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Abstract 

Multi-armed bandit problems are the most basic examples of sequential 
decision problems with an exploration-exploitation trade-off. This is 
the balance between staying with the option that gave highest payoffs 
in the past and exploring new options that might give higher payoffs 
in the future. Although the study of bandit problems dates back to 
the Thirties, exploration-exploitation trade-offs arise in several modern 
applications, such as ad placement, website optimization, and packet 
routing. Mathematically, a multi-armed bandit is defined by the payoff 
process associated with each option. In this survey, we focus on two 
extreme cases in which the analysis of regret is particularly simple and 
elegant: i.i.d. payoffs and adversarial payoffs. Besides the basic setting 
of finitely many actions, we also analyze some of the most important 
variants and extensions, such as the contextual bandit model. 



Contents 



1 Introduction [l| 



2 Stochastic bandits: fundamental results |8| 

2.1 Optimism in face of uncertainty la 

2.2 Upper Confidence Bound (UCB) strategies IjJ 

2.3 Lower bound 13 

2.4 Refinements and bibliographic remarks [l6| 

3 Adversarial bandits: fundamental results \21 

3.1 Pseudo-regret bounds 22 

3.2 High probability and expected regret bounds 27 

3.3 Lower Bound 3J 

3.4 Refinements and bibliographic remarks [37 

4 Contextual bandits |43 

4.1 Bandits with side information 44 



i 



ii Contents 



4.2 The expert case 

4.3 Stochastic contextual bandits 

4.4 The multiclass case 

4.5 Bibliographic remarks 

5 Linear bandits 

5.1 Exp2 (Expanded Exp) with John's exploration 

5.2 Online Mirror Descent (OMD) 

5.3 Online Stochastic Mirror Descent (OSMD) 

5.4 Online combinatorial optimization 

5.5 Improved regret bounds for bandit feedback 

5.6 Refinements and bibliographic remarks 

6 Nonlinear bandits 

6.1 Two-points bandit feedback 

6.2 One-point bandit feedback 

6.3 Nonlinear stochastic bandits 

6.4 Bibliographic remarks 

7 Variants 

7.1 Markov Decision Processes, restless and sleeping bandits 

7.2 Pure exploration problems 

7.3 Dueling bandits 

7.4 Discovery with probabilistic expert advice 

7.5 Many-armed bandits 

7.6 Truthful bandits 

7.7 Concluding remarks 



Acknowledgements 



1 



Introduction 



A multi-armed bandit problem (or, simply, a bandit problem) is a se- 
quential allocation problem denned by a set of actions. At each time 
step, a unit resource is allocated to an action and some observable 
payoff is obtained. The goal is to maximize the total payoff obtained 
in a sequence of allocations. The name bandit refers to the colloquial 
term for a slot machine ("one-armed bandit" in American slang). In a 
casino, a sequential allocation problem is obtained when the player is 
facing many slot machines at once (a "multi-armed bandit"), and must 
repeatedly choose where to insert the next coin. 

Bandit problems are basic instances of sequential decision making 
with limited information, and naturally address the fundamental trade- 
off between exploration and exploitation in sequential experiments. In- 
deed, the player must balance the exploitation of actions that did well 
in the past and the exploration of actions that might give higher payoffs 
in the future. 

Although the original motivation of Thompson! |l933 ] for studying 
bandit problems came from clinical trials (when different treatments 
are available for a certain disease and one must decide which treat- 
ment to use on the next patient), modern technologies have created 
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2 Introduction 



many opportunities for new applications, and bandit problems now 
play an important role in several industrial domains. In particular, on- 
line services are natural targets for bandit algorithms, because there 
one can benefit from adapting the service to the individual sequence of 
requests. We now describe a few concrete examples in various domains. 

Ad placement is the problem of deciding which advertisement to 
display on the web page delivered to the next visitor of a website. 
Similarly, website optimization deals with the problem of sequentially 
choosing design elements (font, images, layout) for the web page. Here 
the payoff is associated with visitor's actions, e.g., clickthroughs or 
other desired behaviors. Of course there are important differences with 
the basic bandit problem: in ad placement the pool of available ads 
(bandit arms) may change over time, and there might be a limit on the 
number of times each ad could be displayed. 

In source routing a sequence of packets must be routed from a source 
host to a destination host in a given network, and the protocol allows to 
choose a specific source-destination path for each packet to be sent. The 
(negative) payoff is the time it takes to deliver a packet, and depends 
additively on the congestion of the edges in the chosen path. 

In computer game-playing, each move is chosen by simulating and 
evaluating many possible game continuations after the move. Algo- 
rithms for bandits (more specifically, for a tree-based version of the 
bandit problem) can be used to explore more efficiently the huge tree 
of game continuations by focusing on the most promising subtrees. 
This idea has be en successfully implemented in the MoGo player of 



Gellv et al.l 20061 ] . which plays Go at world-cla ss level. MoGo is based 



on th e UCT strategy for hierarchical bandits of lKocsis and Szepesvari 



2006], which is in turn derived from the UCB bandit algorithm — see 



Chapter El 

There are three fundamental formalizations of the bandit problem 
depending on the assumed nature of the reward process: stochastic, ad- 
versarial, and Markovian. Three distinct playing strategies have been 
shown to effectively address each specific bandit model: the UCB al- 
gorithm in the stochastic case, the Exp3 randomized algorithm in the 
adversarial case, and the so-called Gittins indices in the Markovian 
case. In this survey, we focus on stochastic and adversarial bandits, 
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and refer the reader to the surv ey bv iMahaian and Teneketzisl 200S] 
or to the recent monograph by iGittins et al. (201 1\ for an extensive 
analysis of Markovian bandits. 

In order to analyze the behavior of a player or forecaster (i.e., the 
agent implementing a bandit strategy), we may compare its perfor- 
mance with that of an optimal strategy that, for any horizon of n time 
steps, consistently plays the arm that is best in the first n steps. In 
other terms, we may study the regret of the forecaster for not play- 
ing always optimally. More specifically, given K > 2 arms and given 
sequences Xi i, X^, ... of unknown rewards associated with each arm 
i = 1,...,K, we study forecasters that at each time step t = 1,2,... 
select an arm I t and receive the associated reward X ltjt . The regret 
after n plays Ii,...,I n is defined by 



R n = max ^^X i)t - ^X Iut ■ (1.1) 
*~ t=i t=\ 

If the time horizon is not known in advance we say that the forecaster 
is anytime. 

In general, both rewards Xit and forecaster's choices It might be 
stochastic. This allows to distinguish between the two following notions 
of averaged regret: the expected regret 



ER n = 

and the pseudo-regret 

Rn 



E 



max 
=l,...,Jf 



max E 

=l,...,K 



t=l 



n 

,t=i 



n 

E 



Xi, 



n 

E 

t=i 



(1.2) 



(1.3) 



In both definitions, the expectation is taken with respect to the random 
draw of both rewards and forecaster's actions. Note that pseudo-regret 
is a weaker notion of regret, since one compares to the optimal action 
in expectation. The expected regret, instead, is the expectation of the 
regret with respect to the action which is optimal on the sequence of 
reward realizations. More formally o ne has R „ < ER n . 

In the ori ginal f ormal ization of Robbinsl 1952jl , whic h builds on 



the work of Wald 1947 ] — see also Arrow et al.l 19491 ] . each arm 
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i = 1,...,K corresponds to an unknown probability distribution v\ 
on [0, 1], and rewards X^t are independent draws from the distribution 
Vi corresponding to the selected arm. 



The stochastic bandit problem 

Known parameters: number of arms K and (possibly) number of rounds n > K. 
Unknown parameters: K probability distributions V\, . . . , Vk on [0, 1]. 

For each round t = 1, 2, . . . 

(1) the forecaster chooses I t £ {1, . . . , K}; 

(2) given It, the environment draws the reward Xi tt t ~ vi t indepen- 
dently from the past and reveals it to the forecaster. 



For i = 1, . . . , K we denote by fii the mean of V{ (mean reward of arm 
i). Let 

jj* = max m and i* G argmax/ij . 
i=l,—,K i=l,...,K 

In the stochastic setting, it is easy to see that the pseudo-regret can be 
written as 

n 

Rn = nv*-Y, E M ■ (1-4) 

t=i 

The analysis of the stochastic bandit model was pioneered in the sem- 
inal paper of lLai and Robbinsl 19851 ] . who introduced the technique 
of upper confidence bounds for the asymptotic analysis of regret. In 
C hapter [2] we de scribe this technique using the simpler formulation 
of lAgrawall 19951 ] . which naturally lends itself to a finite-time analysis. 

In parallel to the research on stochastic bandits, a game-theoretic 
formulation of the trade-off between exploration and exploitation has 
been independently investigated, although for quite some time this al- 
ternative formulation was not recognized as an instance of the multi- 
armed bandit problem. In order to motivate these game-theoretic ban- 
dits, consider again the initial example of gambling on slot machines. 
We now assume that we are in a rigged casino, where for each slot 
machine i = 1, . . . , K and time step t > 1 the owner sets the gain Xj^ 
to some arbitrary (and possibly maliciously chosen) value g^t £ [0,1]. 
Note that it is not in the interest of the owner to simply set all the 
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gains to zero (otherwise, no gamblers would go to that casino). Now 
recall that a forecaster selects sequentially one arm It G {1, . . . , K} at 
each time step t = 1,2,... and observes (and earns) the gain gj ut - Is 
it still possible to minimize regret in such a setting? 

Following a standard terminology, we call adversary, or opponent, 
the mechanism setting the sequence of gains for each arm. If this mecha- 
nism is independent of the forecaster's actions, then we call it an obliv- 
ious adversary. In general, however, the adversary may adapt to the 
forecaster's past behaviour, in which case we speak of a non-oblivious 
adversary. For instance, in the rigged casino the owner may observe 
the way a gambler plays in order to design even more evil sequences of 
gains. Clearly, the distinction between oblivious and non-oblivious ad- 
versary is only meaningful when the player is randomized (if the player 
is deterministic, then the adversary can pick a bad sequence of gains 
right at the beginning of the game by simulating the player's future 
actions). Note, however, that in presence of a non-oblivious adversary 
the interpretation of regret is ambiguous. Indeed, in this case the as- 
signment of gains ga to arms i = 1, . . . ,K made by the adversary 
at each step t is allowed to depend on the player's past randomized 
actions I\, . . . , h-i- In other words, g^t = gi,t(h, ■ ■ ■ , h-i) for each i 
and t. Now, the regret compares the player's cumulative gain to that 
obtained by playing the single best arm for the first n rounds. How- 
ever, had the player consistently chosen the same arm i in each round, 
namely It = i for t = 1, . . . , n, the adversarial gains g^t{Ii, • • • , h-i) 
would have been possibly different than those actually experienced by 
the player. 

The study of non-oblivious regret is mainly motivated by the con- 
necti on between regret minimization and equilibria in games — see, 



e.g. |Auer et al.l . l2002bl . Section 9]. Here we just observe that game- 
theoretic equilibria are indeed defined similarly to regret: in equilib- 
rium, the player has nSo incentive to behave differently provided the 
opponent does not react to changes in the player's behaviour. Inter- 
estingly, regret minimization has been also studied against reactive 
oppon ents, see for instance th e works of lPucci de Farias and Megiddo 



20061 ] and lArora et al.l [2012a ]. 
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The adversarial bandit problem 

Known parameters: number of arms K > 2 and (possibly) number of rounds 
n > K. 

For each round t = 1, 2, . . . 

(1) the forecaster chooses It £ {1, . . . , K}, possibly with the help of 
external randomization; 

(2) simultaneously, the adversary selects a gain vector gt = 



domization 



(jK,t) € [0, 1] K , possibly with the help of external ran- 



(3) the forecaster receives (and observes) the reward gi t ,t, while the 
gains of the other arms are not observed. 



In this adversarial setting the goal is to obtain regret bounds in high 
probability or in expectation with respect to any possible randomiza- 
tion in the strategies used by the forecaster or the opponent, and irre- 
spective of the opponent. In the case of a non-oblivious adversary this 
is not an easy task, and for this reason we usually start by bounding 
the pseudo-regret 



R n = max E 
i=l,...,K 



_t=i t=i 

Note that the randomization of the adversary is not very important 
here since we ask for bounds which hold for any opponent. On the 
other hand, it is fundamental to allow randomization for the forecaster 
— see Chapter for details and basic results in the adversarial ban- 
dit model. This adversarial, or non-stochastic, version of the bandit 
problem was originally proposed as a way of playing an unknown game 
against an opponent. The problem of playing a game repeatedly, now 
a classical topic in game theory, was initiated by the groundbreaking 
wor k of Jam e s Han nan and David Blackwell. In Hannan's seminal pa- 
per 



Hannanl [19571 ] . the game (i.e., the payoff matrix) is assumed to 
be known by the player , who also observes the opponent's moves in 
each play. Later, Bahos 1968| considered the problem of a repeated 



unknown game, where in each game round the player only observes 
its own payoff. This problem turns out to be exactly equivalent to 
the adversarial bandit problem with a non-oblivious adversary. Sim- 
pler strategies for playing unknown games were more recently proposed 
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bv lFoster and Vohral [19981 ] and lHart and Mas-Colelll [2000|, l2001f] . Ap- 
proximately at t he same time, the problem was re-discovered in com- 
puter science bv lAuer et ah] 2002bJ]. It was them who made apparent 
the connection to stochastic bandits by coining the term nonstochastic 
multi-armed bandit problem. 

The third fundamental model of multi-armed bandits assumes that 
the reward processes are neither i.i.d. (like in stochastic bandits) nor 
adversarial. More precisely, arms are associated with K Markov pro- 
cesses, each with its own state space. Each time an arm i is chosen in 
state s, a stochastic reward is drawn from a probability distribution i/j. s , 
and the state of the reward process for arm i changes in a Markovian 
fashion, based on an underlying stochastic transition matrix Mj. Both 
reward and new state are revealed to the player. On the other hand, 
the state of arms that are not chosen remains unchanged. Going back 
to our initial interpretation of bandits as sequential resource allocation 
processes, here we may think of K competing projects that are sequen- 
tially allocated a unit resource of work. However, unlike the previous 
bandit models, in this case the state of a project that gets the resource 
may change. Moreover, the underlying stochastic transition matrices 
Mi are typically assumed to be known, thus the optimal policy can be 
computed via dynamic programming and t he problem is e ssentially of 
computational nature. The seminal result of lGittinsI [19791 ] provides an 
optimal greedy policy which can be computed efficiently 

A notable special case of Markovian bandits is that of Bayesian 
bandits. These are parametric stochastic bandits, where the parame- 
ters of the reward distributions are assumed to be drawn from known 
priors, and the regret is computed by also averaging over the draw 
of parameters from the prior. The Markovian state change associated 
with the selection of an arm corresponds here to updating the posterior 
distribution of rewards for that arm after observing a new reward. 

Markovian bandits are a standard model in the areas of Operations 
Research and Economics. However, the techniques used in their analysis 
are significantly different from those used to analyze stochastic and 
adversarial bandits. For this reason, in this survey we do not cover 
Markovian bandits and their many variants. 
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Stochastic bandits: fundamental results 



We start by recalling the basic definitions for the stochastic bandit 
problem. Each arm i G {1, . . . ,K} corresponds to an unknown prob- 
ability distribution V{. At each time step t = 1,2, ... , the forecaster 
selects some arm It £ {1, . . . ,K} and receives a reward Xj u t drawn 
from vi t (independently from the past). Denote by [ii the mean of arm 
i and define 

jj* = max [ii and i* G argmax /ij . 
i=l,-,K i=l,...,K 

We focus here on the pseudo-regret, which is defined as 

n 

i?„ = n/i*-Ej^ 7t . (2.1) 
t=i 

We choose the pseudo-regret as our main quantity of interest because 
in a stochastic framework it is more natural to compete against the 
optimal action in expectation, rather than the optimal action on the se- 
quence of realized rewards (as in the definition of the plain regret (|1.1|) ). 
Furthermore, because of the order of magnitude of typical random fluc- 
tuations, in general one cannot hope to prove a bound on the expected 
regret (II. 2p better than ©(-^/n). On the contrary, the pseudo-regret 
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2.1. Optimism in face of uncertainty 9 



can be controlled so well that we are able to bound it by a logarithmic 
function of n. 

In the following we also use a different formula for the pseudo-regret. 
Let Tj(s) = Ylt=i ^h=i denote the number of times the player selected 
arm i on the first s rounds. Let Aj = ri* — /ij be the suboptimality 
parameter of arm i. Then the pseudo-regret can be written as: 



2.1 Optimism in face of uncertainty 

The difficulty of the stochastic multi-armed bandit problem lies in the 
exploration- exploitation dilemma that the forecaster is facing. Indeed, 
there is an intrinsic tradeoff between exploiting the current knowledge 
to focus on the arm that seems to yield the highest rewards, and ex- 
ploring further the other arms to identify with better precision which 
arm is actually the best. As we shall see, the key to obtain a good strat- 
egy for this problem is, in a certain sense, to simultaneously perform 
exploration and exploitation. 

A simple heuristic principle for doing that is the so-called optimism 
in face of uncertainty. The idea is very general, and applies to many se- 
quential decision making problems in uncertain environments. Assume 
that the forecaster has accumulated some data on the environment and 
must decide how to act next. First, a set of "plausible" environments 
which are "consistent" with the data (typically, through concentration 
inequalities) is constructed. Then, the most "favorable" environment is 
identified in this set. Based on that, the heuristic prescribes that the de- 
cision which is optimal in this most favorable and plausible environment 
should be made. As we see below, this principle gives simple and yet 
almost optimal algorithms for the stochastic multi-armed bandit prob- 
lem. More complex algorithms for various extensions of the stochastic 
multi-armed bandit problem are also based on the same idea which, 
along with the exponential weighting scheme presented in Section El is 
an algorithmic cornerstone of regret analysis in bandits. 
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2.2 Upper Confidence Bound (UCB) strategies 

In this section we assume that the distribution of rewards X satisfy 
the following moment conditions. There exists a convex functiorQ ip on 
the reals such that, for all A > 0, 

lnEe A ( x - E[x] ) < V(A) and In E e A ( E[x] " x ) < V(A) . (2.2) 

For example, when X G [0, 1] one can take ^(A) = In this case (|2.2[) 
is known as Hoeffding's lemma. 

We attack the stochastic multi-armed bandit using the optimism in 
face of uncertainty principle. In order do so, we use assumption (|2.2j) to 
construct an upper bound estimate on the mean of each arm at some 
fixed confidence level, and then choose the arm that looks best under 
this estimate. We need a standard notion from convex analysis: the 
Legendre-Fenchel transform of V , ) defined by 

tp*(s) =sup(Ae-V(A)) . 



For instance, if ip(x) = e x then ip*(x) = xhix — x for x > 0. If ip(x) = 
^\x\ p then ip*(x) = ^\x\ q for any pair 1 < p, q < oo such that - + - = 1 



-see also Section 15.21 where the same notion is used in a different 
bandit model. 

Let jli tS be the sample mean of rewards obtained by pulling arm 
i for s times. Note that since the rewards are i.i.d., we have that in 
distribution juj jS is equal to j Yst=i ^>* - 

Using Markov's inequality, from (|2.2p one obtains that 

P(Mi - > e) < e"^*( £ ) . (2.3) 
In other words, with probability at least 1 — 5, 

V-i,s + {tp*T l Q In > m ■ 

We thus consider the following strategy, called (a, f/^-UCB, where a > 
is an input parameter: At time t, select 

a hit 



It G argmax 
i=l,...,K 



hm-i) + (^*) 1 ( 



m-i) 



We can prove the following simple bound. 



1 One can easily generalize the discussion to functions ip defined only on an interval [0, f 



2.2. Upper Confidence Bound (UCB) strategies 11 



Theorem 2.1 (Pseudo-regret of (a,^)-UCB). Assume that the 
reward distributions satisfy (|2.2|) . Then (q,^)-UCB with a > 2 sat- 
isfies 

Rn< > , . lnn + 



i : A;>0 



V»*(Ai/2) a-2 



In case of [0, l]-valued random variables, taking = ^ in 
the Hoeffding's Lemma — gives V'*( e ) = 2e 2 , which in turns gives the 
following pseudo-regret bound 

i:Ai>0 V 4 7 

In this important special case of bounded random variables we refer to 
(a,V')-UCB simply as a-UCB. 

Proof. First note that if If = i, then at least one of the three following 
equations must be true: 

?,.,T,. (t - 1) + »*)- 1 ( i ^ Iy )<M' (2.5) 

hm-D > * + W> ) 77T77 TT (2-6) 



m / , a In n , 

Indeed, assume that the three equations are all false, then we have: 
. f ,, r i ( olnt \ „ 

= m + A» 

, *x 1 / ami \ 

> /Ji,T,«-i) + WT 1 (i^zi)) 
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which implies, in particular, that It ^ i. In other words, letting 

a Inn 



r(Ai/2) 
we just proved 

n n 

ET ! (n)=E^l, =! < U + E £ t It=i(mdmisMse 

t=l t=u+l 
n 

< u + E ^ 1 or ([HI) is true 

t=u+l 



+ ^ P (d23I) is true) + P(dZS]) is true). 



it 

t=u+l 



Thus it suffices to bound the probability of the events (|2.5p and (|2.6|) . 
Using an union bound and (|2.3p one directly obtains: 

P((H5) is true) < P ^3s G {1, •••,£} : + (V*)" 1 (j^) ^ H*) 

< v- = — 

s=l 

The same upper bound holds for (|2.6p . Straightforward computations 
conclude the proof. □ 

2.3 Lower bound 

We now show that the result of the previous section is essentially 
unimprovable when the reward distributions are Bernoulli. For p,q G 
[0, 1] we denote by kl(p, q) the Kullback-Leibler divergence between a 
Bernoulli of parameter p and a Bernoulli of parameter q, defined as 

p 1 — p 
kl(p, q) = p In — h (1 — p) In . 
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Theorem 2.2 (Distribution-dependent lower bound). Consider 
a strategy that satisfies ETj(n) = o(n a ) for any set of Bernoulli reward 
distributions, any arm i with Aj > 0, and any a > 0. Then, for any set 
of Bernoulli reward distributions the following holds 

Rn \ — ^ ^7 

km mi > > 

n — *-4-rsn In n — ^ 



In order to compare this result with (|2.4|) we use the following standard 
inequalities (the left hand side follows from Pinsker's inequality, and 
the right hand side simply uses In a; < x — 1), 

2(p- q ) 2 <k\(p,q)<^^. (2.8) 

Proof. The proof is organized in three steps. For simplicity, we only 
consider the case of two arms. 



First step: Notations. 

Without loss of generality assume that arm 1 is optimal and arm 2 is 
suboptimal, that is ^2 < /^i < 1- Let e > 0. Since x >->• kl(/j,2,x) is 
continuous one can knd // 2 G (p\, 1) such that 

kh> 2 ,/4) < (l + £)kh> 2 ,Ati) ■ (2.9) 

We use the notation E',P' when we integrate with respect to the mod- 
ified bandit where the parameter of arm 2 is replaced by f/ 2 . We want 
to compare the behavior of the forecaster on the initial and modified 
bandits. In particular, we prove that with a big enough probability the 
forecaster can not distinguish between the two problems. Then, using 
the fact that we have a good forecaster by hypothesis, we know that 
the algorithm does not make too many mistakes on the modified ban- 
dit where arm 2 is optimal. In other words, we have a lower bound on 
the number of times the optimal arm is played. This reasoning implies 
a lower bound on the number of times arm 2 is played in the initial 
problem. 
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We now slightly change the notation for rewards and denote by 
X 2) i, ■ ■ ■ , X 2t n the sequence of random variables obtained when pulling 
arm 2 for n times (that is, X 2)S is the reward obtained from the s-th 
pull). For s € {1, . . . , n}, let 

ft = A, /J2*2,t + (l-M2)(l-*2,t) 

Note that, with respect to the initial bandit, klx 2 ( n ) is the (non re- 
normalized) empirical estimate of kl(//2j a ^ time n, since in that case 
the process (X 2jS ) is i.i.d. from a Bernoulli of parameter /X2. Another 
important property is the following: for any event A in the a-algebra 
generated by X 2j i, . . . , X 2>n the following change-of-measure identity 
holds: 

P , (^)=E[l A exp(-kl T2(n) )] . (2.10) 

In order to link the behavior of the forecaster on the initial and modified 
bandits we introduce the event 

C n = |r 2 (n) < kl ^ 2 ~i ) ln ( n ) and klT 2 (n) < (l - I) ln(n)| . 

/i2 ' M2 (2.11) 

Second step: P(C n ) = o(l). 
By (^JOj) and (I2TT|> one has 

P'(C„) = E l Cn exp (-ki T2(n) ) > e -(i-/2)M«)p (Cn ) . 
Introduce the shorthand 

f n = T77 ~M ln ( n ) • 

kl(Ai 2 ,/i 2 ) 

Using again (|2.1ip and Markov's inequality, the above implies 

P(C„) < n^^P'^) < nd-^P'CTaCn) < /„) 

<n (i-em nn-T 2 (n)] 

" - Jn 
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Now note that in the modified bandit arm 2 is the unique optimal arm. 
Hence the assumption that for any bandit, any suboptimal arm i, and 
any a > 0, the strategy satisfies ETj(n) = o(n a ), implies that 

F(C n )<n^ W[n - T f n)] =o(l). 

n- fn 

Third step: P(T 2 (n) < f n ) = o(l). 
Observe that 

F(C n ) > F ( T 2 (n) < /„ and maxkl s < f 1 - J) lnfa) 



r 2 (n) < /„ 

and ay ) x fis < I^ k uW 2 )) . (2.12) 

(1 - e) in(nj s</ n 1 — e / 

Now we use the maximal version of the strong law of large numbers: for 
any sequence {Xt) of independent real random variables with positive 
mean ri > 0, 

1 n I s 

lim — y Xt = n a.s. implies lim — max N X-t = fi a.s. 

n— too n — ' n—>oo n s=l — ' 

t=i t=i 

See, e.g., [Bubeck . 201C, Lemma 10.5]. 

bmce we deduce that 

lim p f ^M) x fis < i_z£/^i( M2!/ / 2 )) = i . 

n^oo \(1 — e)ln(n) s</„ 1 — e / 

Thus, by the result of the second step and (|2. 12(1 . we get 

P(T 2 (n) </ n ) = o(l) . 
Now recalling that f n = kl (*~^/ ) hi(n), and using (12.91) . we obtain 

B*)>(1 + .(1))^ 5 ^ 
1 +ekl(/i 2 ,Mi) 

which concludes the proof. □ 
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2.4 Refinements and bibliographic remarks 



The UCB strate gy presented in Section 12.21 was introduced 
Auer et al.l 2002aH for bounded rand om variables. Theorem 



tracted from 



by 



is ex- 



Lai and Robbing 19851 ] . Note that in this last paper the 



result is more gene ral than ours, which is re s tricte d to Bernoulli distri- 
butions. Although iBurnetas and Katehakisl 19971 ] prove an even more 
general lower bound, Theorem 12.21 and the UCB regret bound provide 
a reasonably complete solution to the problem. We now discuss some 
of the possible refinements. In the following, we restrict our attention 
to the case of bounded rewards (except in Section 12.4. 7p . 



2.4.1 Improved constants 

The regret bound proof for UCB can be improved in two ways. First, 
the union bound over the different time steps can be replaced by a 
"peeling" argument. This allows to show a logarithmic regre t for any 



a > 1, whereas the proof of Section [231 requires a > 2 — see [Bubeckl . 
2010 , Section 2.2] for more details. A second improvement, proposed by 
Garivier and Cappel [201ltJ , is to use a more subtle set of conditions than 
(|2.5p - (|2.7p . In fact, the authors take both improvements into account, 
and show that a-UCB has a regret of order ^lnn for any a > 1. In 
the limit when a tends to 1, this constant is unimprovable in light of 
Theorem O and §2M ■ 



2.4.2 Second order bounds 

Although a-UCB is essentially optimal, the gap between (I2.4D and The- 
orem [2~2"1 can be important if kl(/ij* , iij) is significantly larger than Af. 
Several improvements have been pro posed towards closing this gap. In 



particular, the UCB-V algorithm of lAudibert et al.l 20091 ] takes into 



account the variance of the distributions and replaces Hoeffding's in- 
equality by Bernstein's inequality in the derivatio n of UCB. A previ- 



ous algorithm with similar ideas was developed bv lAuer et al.l 2002a ] 



without theoretical guarantees. A second type of approach replaces L2- 
neig hborhoods in q-UCB by kl- nei ghborhoods. This line of work started 
with Honda and Takemura 201ol ] where only asymptotic guarantees 
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were provided. Later. iGariyier and Cappg [20111 ] and iMaillard et al 
2011 ] (see also Cappe et al. 20121 ]) independently proposed a similar 



algorithm, called KL-UCB, w hich is shown to atta i n the optimal rate 
in finite-time. More precisely, iGarivier and Cappel 20111 ] showed that 
KL-UCB attains a regret smaller than 

A,; 



E 

: Ai>0 



a Inn + Oil) 



where a > 1 is a parameter of the algorithm. Thus, KL-UCB is opti- 
mal for Bernoulli distributions, and strictly dominates a-UCB for any 
bounded reward distributions. 



2.4.3 Distribution-free bounds 

In the limit when Aj tends to 0, the upper bound in (|2.4|) becomes 
vacuous. On the other hand, it is clear that the regret incurred from 
pulling arm i cannot be larger than n Aj. Using this idea, it is easy to 
show that the regret of a-UCB is always smaller than V cmK In n (up 
to a numerical constant). However, as we shall see in the next chapter, 
one can show a minimax lo wer bound on the regret of order \friK. 
Audibert and Bubeckl [20091 ] proposed a modification of a-UCB that 
gets rid of the extraneous loga rithmic term in the upper b ound. More 
precisely, let A = min^. A,, Audibert and Bubeckl 20ld ] show that 



MOSS (Minimax Optimal Strategy in the Stochastic case) attains a 
regret smaller than 



mm 



up to a numerical constant. The weakness of this result is that the 
second term in the above equ ation only depends on the smallest ga p 
A. In Auer and Ortner 201ol ] (see also Perchet and Rigollet 2011 ]) 
the authors designed a strategy, called improved UCB, with a regret of 
order 

E ir ln ( nA ' 2 ) • 

i:Ai>0 * 

This latter regret bound can be better than the one for MOSS in some 
regimes, but it does not attain the minimax optimal rate of order \J nK. 
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It is an open problem to obtain a strategy with a regret always better 
than those of MOSS and improved UCB. A plausible conjecture is that 
a regret of order 

1 

a? 



V — In — with H = V 

i:Ai>0 i:A;>0 

is attainable. Note that the quantity H app ears in other var i ants o f the 



stochastic multi-armed bandit problem, see lAudibert et"afl |20ld ]. 



2.4.4 High probability bounds 

While bounds on the pseudo-regret R n are important, one typically 
wants to control the quantity i? n = nu* — Ylt=l Mit with high proba- 
bility. Showing that R n is close to its expectation R ri is a challenging 
task, since naively one might expect the fluctuations of R n to be of 
order yjn, which would dominate the expectation R n which is only 
of order Inn. The concentration proper ties of R n for UCB are ana- 



lyzed in detail in lAudibert et all [20ni ]. where it is shown that R n 



concentrates around its expectation, but that there is also a polyno- 
mial (in n) probability that R n is of order n. In fact the polynomial 
concentration of R „ around R n can be directly d erived from our proof 
of Theorem 12.11 In ISalomon and Audiberti 20111 ] it is proved that for 
anytime strategies (i.e., strategies that do not use the time horizon n) 
it is basically impossible to improve this polynomial concentration to 
a classical exponential concentration. In particular this shows that, as 
far as high probability bounds are concerned, anytime strategies are 
surprisingly weaker than strategies using the time horizon information 
(fo r which exponential c oncentration of R n around Inn are possible, 



sec 



Audibert et al.1 [2009]) 



2.4.5 e-greedy 

A simple and popu lar heuristic for bandit problems is the e-greedy 
strategy — see, e.g. 



Sutton and Bartd [19981 ] . The idea is very simple. 



First, pick a parameter < e < 1. Then, at each step greedily play the 
arm with highest empirical mean r eward with probabi lity 1— e, and play 
a random arm with probability e. lAuer et al.1 2002al ] proved that, if e 
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is allowed to be a certain function et of the current time step t, namely 
Et = K/(d 2 t), then the regret grows logarithmically like (Klnn)/d 2 , 
provided < d < niin, .. ,-• A,. W hile this bound has a suboptimal 



dependence on d. lAuer et al.l 2002al ] show that this algorithm performs 
well in practice, but the performance degrades quickly if d is not chosen 
as a tight lower bound of min,^* Aj. 



2.4.6 Thompson sampling 



In the very first paper on the multi-armed bandit problem, iThompson 



19331 ] . a simple strategy was proposed for the case of Bernoulli dis- 



tributions. The so-called Thompson sampling algorithm proceeds as 
follows. Assume a uniform prior on the parameters Hi G [0, 1], let tth 
be the posterior distribution for /ij at the t th round, and let 9^t ~ ^i,t 
(independently from the past given vr^j). The strategy is then given by 
It € argmax i=1 K 9^. Recently there has been a surge of interest for 
this simple policy, mainly because of it s flexibility to incorpor ate prior 
knowledge on the arms, see for example IChapelle and Lil 20111 ] and the 
references therein. While the theoretical behavior of Thompson sam- 
pling has remained elusive for a long time, we have now a fairly good un 



derstanding of its theoretical properties: in lAgrawal and Goval 



the firs t logarithmic regret bound was proved, and in 



201 



21 



Kaufmann et al 



2012bl ] it was showed that in fact Thompson sampling attains es- 
sentially the same regret than in ()2.4p . Interestingly note that while 



Thompson sampling comes from a Bayesian reasoning, it is analyzed 
with a frequentist perspective. For more on the interplay between 
B ayesian strategy and freq uentist regret analysis we refer the reader 



to 



Kaufmann et al. 



2012a ]. 



2.4.7 Heavy-tailed distributions 

We showed in Section 12.21 how to obtain a UCB-type strategy through 
a bound on the moment generating function. Moreover one can see 
that the resulting bound in Theorem 12. II deteriorates as the tail of the 
distributions become heavier. In particular, we did not provide any 
result for the case of distributions for which th e moment generating 
function is not finite. Surprisingly, it was shown in lBubeck et al. 2012b ] 
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that in fact there exists strategy with essentially the same regret than 
in (|2.4p . as soon as the variance of the distributions are finite. More 
precisely, using more refined robust estimators of the mean than the 
basic empirical mean, one can construct a UCB-type strategy such that 
for distributions with moment of order 1 + e bounded by 1 it satisfies 



In n + 5 Aj 




We refer the interested reader to lBubeck et al 
on these 'robust' versions of UCB. 



2012bj ] for more details 



3 



Adversarial bandits: fundamental results 



In this chapter we consider the important variant of the multi-armed 
bandit problem where no stochastic assumption is made on the gener- 
ation of rewards. Denote by the reward (or gain) of arm i at time 
step t. We assume all rewards are bounded, say gij G [0,1]. At each 
time step t = 1,2,..., simultaneously with the player's choice of the 
arm It £ { 1 , . . . , K} , an adversary assigns to each arm i = 1 , . . . , K 
the reward gij- Similarly to the stochastic setting, we measure the per- 
formance of the player compared to the performance of the best arm 
through the regret 

n n 

R n = max Fffi,! - ^2 9h,t ■ 
1 t=i t=i 

Sometimes we consider losses rather than gains. In this case we denote 
by £i jt the loss of arm i at time step t, and the regret rewrites as 

n n 

Rn = 5^J t) t - . min ^2kt ■ 
t=i % t=i 

The loss and gain versions are symmetric, in the sense that one can 
translate the analysis from one to the other setting via the equivalence 

21 
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4,t = 1 — g%,t- In the following we emphasize the loss version, but we 
revert to the gain version whenever it makes proofs simpler. 

The main goal is to achieve sublinear (in the number of rounds) 
bounds on the regret uniformly over all possible adversarial assignments 
of gains to arms. At first sight, this goal might seem hopeless. Indeed, 
for any deterministic forecaster there exists a sequence of losses (4,t) 
such that R n > n/2. Concretely, it suffices to consider the following 
sequence of losses: 

if I t = 1, then £.2,t = and £ij = 1 for all i ^ 2; 
if It / 1, then £i jt = and = 1 for all i / 1. 

The key idea to get around this difficulty is to add randomization to 
the selection of the action I t to play. By doing so, the forecaster can 
"surprise" the adversary, and this surprise effect suffices to get a regret 
essentially as low as the minimax regret for the stochastic model. Since 
the regret R n then becomes a random variable, the goal is thus to 
obtain bounds in high probability or in expectation on R n (with respect 
to both eventual randomization of the forecaster and of the adversary) . 
This task is fairly difficult, and a convenient first step is to bound the 
pseudo-regret 

n n 

Clearly R n < ER n , and thus an upper bound on the pseudo-regret 
does not imply a bound on the expected regret. As argued in the In- 
troduction, the pseudo-regret has no natural interpretation unless the 
adversary is oblivious. In that case, the pseudo-regret coincides with 
the standard regret, which is always the ultimate quantity of interest. 

3.1 Pseu do- regret bounds 

As we pointed out, in order to obtain non-trivial regret guarantees 
in the adversarial framework it is necessary to consider randomized 
forecasters. Below we describe the randomized forecaster Exp3, which 
is based on two fundamental ideas. 



3.1. Pseudo-regret bounds 23 



Exp3 (Exponential weights for Exploration and Exploitation) 
Parameter: a non-increasing sequence of real numbers {r)t)t&$- 
Let pi be the uniform distribution over {1, . . . , K}. 
For each round t = 1,2, ... , n 

(1) Draw an arm It from the probability distribution pt- 

(2) For each arm i = 1, ... ,K compute the estimated loss 

= ~rT~ t ^ Llt=i an< ^ u Pd a * e * ne estimated cumulative loss 
= Lij-i + £j jS . 

(3) Compute the new probability distribution over arms 
Pt+i = (pi,t+i, • • • ,Pif,t+i), where 



exp ( -r^L^ 



Ef=i exp ( -r/ t L fe 



First, despite the fact that only the loss of the played arm is observed, 
with a simple trick it is still possible to build an unbiased estimator for 
the loss of any other arm. Namely, if the next arm It to be played is 
drawn from a probability distribution pt = (pi.t, ■ ■ ■ ,PK,t), then 

U.t = — l/ t =i 
Pi,t 

is an unbiased estimator (with respect to the draw of It) of i^t- Indeed, 
for each i = 1, . . . , K we have 

K hi 
E/ t ~ Pi [h,t] = ^2pj,t—^j=i = ii,t ■ 

1=1 Pl > 1 

The second idea is to use an exponential reweighting of the cumulative 
estimated losses to define the probability distribution pt from which 
the forecaster will select the arm It. Exponential weighting schemes 
are a standard tool in the study of sequential prediction schemes under 
adversarial assumptions. The r eader is referred to the monograph by 



Cesa-Bianchi and Lugosil [20061 ] for a general introduction to prediction 
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of individual sequences, and to the recent survey by lArora et al.l 2012bj ] 
focussed on computer science applications of exponential weighting. 

We provide two different pseudo-regret bounds for this strategy. The 
bound (|3.3p is obtained assuming that the forecaster does not know the 
number of rounds n. This is the anytime version of the algorithm. The 
bound (|3.2I) . instead, shows that a better constant can be achieved 
using the knowledge of the time horizon. 



Theorem 3.1 (Pseudo- regret of Exp3). If Exp3 is run with rj t = 

R n < V2nKlnK . (3.2) 



V = \/ 2 -7nr> then 



Moeover, if Exp3 is run with rjt = y -jjf, then 



R n < 2VnKlnK . (3.3) 



Proof. We prove that for any non-increasing sequence (rjijt^ Exp3 
satisfies 

- if A InK , 

*n<2"£%+ — ■ ( 3 - 4 ) 
t=i in 



Inequality (13. 2p then trivially follows from (13. 4p . For (|3.3h we use ()3.4p 
and Y%=i 7j < lo = 2 V™- The P roof of in divided in five 

steps. 

First step: Useful equalities. 

The following equalities can be easily verified: 

Plt,t Ph,t 

(3.5) 

In particular, they imply 

n n n n 

E *h,t - E = E ®i~pA,t - E E ^ P Jk,t . (3.6) 
*=i t=\ t=i t=i 
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The key idea of the proof is rewrite Ej^ Pt 4,t as follows 

Ei~ Pt 4t =— lriE^ Pt exp (-rjt(Ii,t - Efc~p t 4,t) 

- — lnE^p* exp (-r)t£i,t) ■ (3.7) 
rj t y \ J 

The reader may recognize lnEj^p ( exp (— rjtl^t) as the cumulant- 
generating function (or the log of the moment-generating function) of 
the random variable ii t t- This quantity naturally arises in the analysis 
of forecasters based on exponential weights. In the next two steps we 
study the two terms in the right-hand side of (|3.7p . 



Second step: Study of the first term in (|3.7]h 

We use the inequalities lnx < x — 1 and exp(— x) — 1 + x < x 2 /2, for 
all x > 0, to obtain: 



lnE^p t exp ( -rj t (£i,t - ^k~p/k,t) 

= lnE^p t exp (-r]4i,t\ + r] t 'E k ^ Pt £ k) t 

< E^ pt (exp {^-rjti^tj - 1 + mh,t 

< (3.8) 
" 2p/t,t 



where the last step comes from the third equality in (|3,5p . 
Third step: Study of the second term in (J3Z 



Let L i>0 = 0, $ fa) = and = ± In i ££i exp (-ryL^J . Then, 

by definition of pt we have 

1 _ / ~ \ 1 E£=i ex P ( 



— lnEj^p t exp -7? t £i jf = In 

Vt 



Vt E£i ex P [~VtLi,t-i^ 
$>t-i(vt) ~ Mvt) ■ (3.9) 
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Fourth step: Summing. 

Putting together (1321), (jS} and <^Mj we obtain 



n n 
fit 



t=i t=i 



The first term is easy to bound in expectation since, by the rule of 
conditional expectations and the last equality in (|3.5p we have 



For the second term we start with an Abel transformation, 
n n— 1 



since <&o(??l) = 0. Note that 

In if 1 



< 



InK 1 
In K 



— In ( y~] exp (-r] n L i n ) ) 

— In (exp (-T} n L k , 

n 

+ E^>* 



and thus we have 



E 



n-l 



t=l 



To conclude the proof, we show that <&' t (ri) > 0. Since r] t +i < r/t, we 
then obtain <E> t (rft + i) - 3>t(rft) < 0. Let 



3.2. High probability and expected regret bounds 27 



Then 



1 (l* { 7 \\ l^£i L Mexp(-r/L M ) 
W = -^( F ^W)- ; E£ , e x P (-„Z,.,) 

1 1 X 

= ~ — k 7 — ^ exp ( _r?Li -*) 

77 Ei=i ex P y-v L i,t) i=i 

x ^-r]L itt -\n exp (-^*,t)^ ■ 

Simplifying, we get (since pi is the uniform distribution over 
{!,...,#}), 



A" 



= iEPM^t) = iKL(p? >Pl ) > 



□ 



3.2 High probability and expected regret bounds 

In this section we prove a high probability bound on the regret. Un- 
fortunately, the Exp3 strategy defined in the previous section is not 
adequate for this task. Indeed, the variance of the estimate is of 
order 1/pit, which can be arbitrarily large. In order to ensure that 
the probabilities pi t t are bounded from below, the original version of 
Exp3 mixes the exponential weights with a uniform distribution over 
the arms. In order to avoid increasing the regret, the mixing coefficient 
7 associated with the uniform distribution cannot be larger than vT 1 ! 2 . 
Since this implies that the variance of the cumulative loss estimate n 
can be of order n 3//2 , very little can be said about the concentration of 
the regret also for this variant of Exp3. 

This issue can be solved by combining the mixing idea with a differ- 
ent estimate for losses. In fact, the core idea is more transparent when 
expressed in terms of gains, and so we turn to the gain version of the 
problem. The trick is to introduce a bias in the gain estimate which 
allows to derive a high probability statement on this estimate. 
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Lemma 3.1. For /3 < 1, let 



9i,t 



giAh=i + P 



Then, with probability at least 1 — 5, 



^9i,t < ^9i,t + 



t=i 



t=i 



ln(Q 

P 



Proof. Let K t be the expectation conditioned on 7i,...,7 t _i. Since 
exp(:r) < 1 + x + x 2 for x < 1, for (3 < 1 we have 



E t exp /3# ijt - (3 



< 1 + 



&7i,t - /3 



9i,t^I t =i 



Pi,t 



- P 



9iAh= i 

Pi,t 



1 2 N 



/3 2 

x exp I 

' Pi,t 

y / V Pi,* 

< l 



where the last inequality uses 1 + u < exp(u). As a consequence, we 
have 



Eexp(/3^ 5M -/3^ 



9iAh=i + Z 3 



t=l 



t=l 



Pi,t 



< 1. 



Moreover, Markov's inequality implies F (X > ln(<5 1 )) < SEe x and 
thus, with probability at least 1 — S, 

n n „ 

^ ^ Pi,t 



t=i 
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ExpS.P 

Parameters: r\ G R + and 7, (3 6 [0, 1]. 

Let £>i be the uniform distribution over {1, . . . , K}. 

For each round t = 1,2, ... ,n 

(1) Draw an arm It from the probability distribution pt. 

(2) Compute the estimated gain for each arm: 

9iAh=i + P 

9i,t — 

Pi,t 

and update the estimated cumulative gain: G{ t t = 

Ss=l 9i,s- 

(3) Compute the new probability distribution over the arms 
Pt+i = (Pl.t+l, • • • ,PK,t+i) where: 



7 



exp ( rjGij 
Lfc=i exp r7G fci i 



Fig. 3.1 Exp3.P forecaster. 



The strategy associated with these new estimates, called Exp3.P, is 
described in Figure [3TT1 Note that, for the sake of simplicity, the strat- 
egy is described in the setting with known time horizon (r] is constant). 
Anytime results can easily be derived with the same techniques as in 
the proof of Theorem 13.11 

In the next theorem we propose two different high probability 
bounds. In f|3. lOj) the algorithm needs the confidence level 5 as an input 
parameter. In (|3.1ip the algorithm satisfies a high probability bound 
for any confidence level. This latter property is particularly important 
to derive good bounds on the expected regret. 



Theorem 3.2 (High probability bound for Exp3.P). For any 
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given 5 G (0, 1), if Exp3.P is run with 



-v 5 ^ 

then, with probability at least 1 — 5, 



R n < 5.15^nKln(K5- r ) . (3.10) 



Moreover, if Exp3.P is run with (3 = y while rj and 7 are chosen 
as before, then, with probability at least 1 — S, 



nK 



R n <J hiOT 1 ) + 5.15y/nKln{K) . (3.11) 



Proof. We first prove (in three steps) that if 7 < 1/2 and (l+/3)Kr] < 7, 
then Exp3.P satisfies, with probability at least 1 — 6, 

Rn < pnK + 7 n + (1 + (3) V Kn + l ^ ~ + ^ ■ (3-12) 

First step: Notation and simple equalities. 

One can immediately see that ^i~ pt gij = gi t ,t + and thus 

n n n n 

9k,t ~ Y 3h,t = P nK + Y 9k > 1 ~ Y E i~Pt9i,t ■ (3.13) 
t=i t=i t=i t=i 

The key step is, again, to consider the cumulant-generating function of 
cjij- However, because of the mixing, we need to introduce a few more 
notations. Let u = (j^>---jj^) be the uniform distribution over the 
arms, let and wt = ^ ~" be the distribution induced by Exp3.P at time 
t without the mixing. Then we have: 

-Einpjjij = -(1 - j)Ei^ Wt g iit - iEi^ u g ijt 

= (1 - 7) Q lnEj^ exp(rt(g it t - E k ^ Wt g kjt )) 

- i lnEi^ Wt exp (ridi,^ - 7E^ u ^ )t . (3.14) 
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Second step: Study of the first term in (13.141) . 

We use the inequalities lnx < x — 1 and exp(x) < 1 + x + x 2 , for all 
x < 1, as well as the fact that rjg^t < 1 since (1 + f3)r]K < 7: 

lnli^ t exp ( ri(g i)t - E k ^ pt g kjt ) ) = lnE^ t exp (77^) - i]E k ^ pt g kt t 



< ¥.■ 



exp (r]g itt ) - 1 - ^ 



< Ei^ Wt r] 2 g 2 



i,t 
K 



<^ 2 E5M (3-15) 
1 — 7 — . 



T *=1 



where we used < 1— in the last step. 

Pi,t — 1-7 



Third step: Summing. 

Set Gi : Q = 0. Recall that Wt = (w\ : t, ■ ■ ■ ,WK,t) with 



exp -rjGij-i 

Wi,t = — ^7 ■ (3.16) 

E fc =i ex P [-vGk,t-ij 

Then substituting (|3.15|) in (|3.14|) and summing using (|3.16p . we obtain 

n 

t=l 

n K n / K \ 

< (1 + p)v ^2 ^2 9i,t — - ^2 ln [J2 Wi t exp 



n K 1 / n \-^K 



t=l i=l ^ t=l \i=l 



(1 + /3)t? ^ ^ &M - — — ln M I — x -~ 

t=ii=i 11 \t=iEi=i^MvGi,t-i) _ 



InK 1-7 

ln(iT) 



< (1 + /3)?7XmaxG iin + — -In Vexp^G^n) 



t=i 



< -(1 - 7 - (1 + /3)r?ir) maxG,„ + 

< -(1 - 7 - (1 + max ± mt + !^<^ + . 



7 ' — ' /3 

t=l ^ 
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The last inequality comes from Lemma 13. 11 the union bound, and 7 — 
(1 + P)rjK < 1 which is a consequence of (1 + /3)r]K < 7 < 1/2. 
Combining this last inequality with (|3.13|) we obtain 

ln( K5- 1 ) \n(K) 
P V 

which is the desired result. 

Inequality (I3.10p is then proved as follows. First, it is trivial if n > 
h.lh^nK hiiKS- 1 ) and thus we can assume that this is not the case. 
This implies that 7 < 0.21 and /3 < 0.1, and thus we have (1 + /3)r)K < 
7 < 1/2. Using (|3.12p directly yields the claimed bound. The same 
argument can be used to derive (|3.1ip . □ 



We now discuss expected regret bounds. As the cautious reader may 
already have observed, if the adversary is oblivious, namely when 
(4,t) • • • , ^K,t) is independent of I\, . . . , It-\ for each t, a pseudo-regret 
bound implies the same bound on the expected regret. This follows 
from noting that the expected regret against an oblivious adversary is 
smaller th an the maximal pseud o -regre t against deterministic adver- 



saries, see 



Audibert and Bubeckl . |2010| . Proposition 33] for a proof of 



this fact. In the general case of a non-oblivious adversary, the loss vector 
(£l t t, ■ ■ ■ ,(K,t) at time t depends on the past actions of the forecaster. 
This makes the analysis of the expected regret more intricate. One 
way around this difficulty is to first prove high probability bounds, and 
then integrate the resulting bound. Following this method, we derive a 
bound on the expected regret of Exp3.P using (|3.1ip . 

Theorem 3.3 (Expected regret of Exp3.P). If Exp3.P is run 
with 



AnK hxK KlnK 

" = V^ " = a95 \br ^ = L05 \/— 

then 



ER n < 5.15VnK\nK + \l . (3.17) 

" In A 
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Proof. We integrate the deviations in (I3.1ip using the formula 



h.r . l 



EW < J -P \W > In- J d« 
for a real-valued random variable W. In particular, taking 



hxK 



W = J— i^Rn - 5.15VnKlnK 
yields EW < 1, which is equivalent to (I3.17D , 



3.3 Lower Bound 

The next theorem shows that the results of the previous sections are es- 
sentially unimprovable, up to logarithmic factors. The result is proven 
via the probabilistic method: we show that there exists a distribution of 
rewards for the arms such that the pseudo-regret of any forecaster must 
be high when averaged over this distribution. Owing to this probabilis- 
tic construction, the lower bound proof is based on the same Kullback- 
Leibler divergence as the one used in the proof of the lower bound 
for stochastic bandits — see Subsection 12.31 We are not aware of other 
techniques for proving bandit lower bounds. 

We find it more convenient to prove the results for rewards rather 
than losses. In order to emphasize that our rewards are stochastic (in 
particular, Bernoulli random variables), we use Yi± € {0, 1} to denote 
the reward obtained by pulling arm % at time t. 



Theorem 3.4 (Minimax lower bound). Let sup be the supremum 
over all distribution of rewards such that, for i = 1, . . . ,K, the re- 
wards 5^1,5^2) • • • £ {0, 1} are i.i.d., and let inf be the infimum over 
all forecasters. Then 



inf sup I max E ^ Y i<t — E ^ Yj ut ) > ^VnK (3.18) 
y i-i,...,k t=i t=i 




where expectations are with respect to both the random generation of 
rewards and the internal randomization of the forecaster. 
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Since maxj = i r ..^E^™ =1 Y%,t — ^Ylt=i ^h,t = Rn < Ei? n , Theorem 13.41 
immediately entails a lower bound on the regret of any forecaster. 

The general idea of the proof goes as follows. Since at least one arm 
is pulled less than n/K times, for this arm one cannot differentiate be- 
tween a Bernoulli of parameter 1/2 and and a Bernoulli of parameter 
1/2 + w K/n. Thus, if all arms are Bernoulli of parameter 1/2 but one, 
whose parameter is 1/2 + w K/n, then the forecaster should incur a 
regret of order n^J K/n = \fnK. In order to formalize this idea, we use 
the Kullback-Leibler divergence, and in particular Pinsker's inequality, 
to compare the behavior of a given forecaster against: (1) the distri- 
bution where all arms are Bernoulli of parameter 1/2; (2) the same 
distribution where the parameter of one arm is increased by e. 

We start by proving a more general lemma, which could also be 
used to derive lower bounds in other contexts. The proof of Theorem 
13.41 then follows by a simple optimization over e. 

Lemma 3.2. Let e G [0,1). For any i £ {1,...,K} let E» be the 
expectation against the joint distribution of rewards where all arms 
are i.i.d. Bernoulli of parameter =^ but arm i, which is i.i.d. Bernoulli 
of parameter it 5 . Then, for any forecaster, 

( m« E, ± (Y lf - Y hf ) > ns (l - ± - fj^ 0} . 



Proof. We provide a proof in five steps by lower bounding 
~R Si=i ^» Y^=i(Xi,t — ^h,t)- This implies the statement of the lemma 
because a max is larger than a mean. 

First step: Empirical distribution of plays. 

We start by considering a deterministic forecaster. Let q n = 
(</i,n> • • • , QK,n) be the empirical distribution of plays over the arms 
defined by q^ n = Ti ^ n ' — recall from Chapter [5] that Tj (n) denotes the 
number of times arm i was selected in the first n rounds. Let J n be 
drawn according to q n . We denote by P, the law of J n against the 
distribution where all arms are i.i.d. Bernoulli of parameter but 



3.3. Lower Bound 35 



arm i, which is i.i.d. Bernoulli of parameter (we call this the i-th 
stochastic adversary). Recall that Pi(J n = j) = E, Tj ^"^ , hence 

n 

Ei^2(Y ht -Y Iut ) = £n J2^(Jn=j) =en(l-Fi(J n = i)) 
t=l jfr 

which implies 

K n / K \ 

^E E ^E( y M-^ t ,t)=^ l--^Pi(J n = i) . (3.19) 



Second step: Pinsker's inequality. 

Let Po be the law of J n for the distribution where all arms are i.i.d. 
Bernoulli of parameter Then Pinsker's inequality immediately 

gives Pi(J n = i)< P (J„ = i) + y^KL(P ,Pi), and so 
1 K 1 1 K fl 

^ 5>(j„ = *)<^ + ^E V2 KL(IPo ' Pi) • (3 - 20) 

j=l i=l 



Third step: Computation of KL(P ,Pj)- 

Since the forecaster is deterministic, the sequence of rewards Y n = 
(Y\, . . . ,Y n ) G {0, l} n received by the forecaster uniquely determines 
the empirical distribution of plays q n . In particular, the law of J n con- 
ditionally to Y n is the same for any i-th stochastic adversary. For 
each i = 0, . . . , K, let P™ be the law of Y n against the i-th. adver- 
sary. Then one can easily show that KL(P ,Pj) < KL(P^,P"). Now we 
use the chain rule for Kullback -Leibler divergence — see for example 



Cesa-Bianchi and Lugosil . 120061 . Section A. 2] — iteratively to introduce 
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the laws P* of Y l = (Y\, . . . , Yt). More precisely, we have 
KL(P£,P") 

n 

= KL(pi,P 4 1 ) + ^^P*- 1 (y'- 1 )KL(P*(- | y^Fti- I l/*" 1 )) 



t=2 «t-l 



t = 2 \yt~l,J t=i 



+ Y n _1 (y t_1 )KL 



1+g 1+£ N 

2 ' 2 , 



= KL(i^,^)E T i (n) . 
Fourth step: conclusion for deterministic forecasters. 



(3.21) 



By using that the square root is concave, and combining KL(Po,Pi) < 
KL(P",P") with (l3~2Tj) . we deduce that 



i K , 

-Yvmn,r t )< 



i=l 



1 K 



i=l 



< 



i 



^E KL (¥. 1 f) E o I, .(«) 



i=l 



A' 



(3.22) 



We conclude the proof for deterministic forecasters by applying (|3.20p 
and (|3~22|) to (l3~T9l) . and observing that KL (±=£, ±±£) = eln i±| . 

Fifth step: randomized forecasters via Fubini's Theorem. 

Extending previous results to randomized forecasters is easy. Denote 
by E r the expectation with respect to the forecaster's internal random- 
ization. Then Fubini's Theorem implies 

^ K n ^ K n 

- y^ E E ^* - = E ^E E * E( y M - ■ 



i=i t=i 



i=l t=l 
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Now the proof is concluded by applying the lower bound on 
iEi^EiE^ife ~~ Yi t} t), which we proved in previous steps, to 
each realization of the forecaster's random bits. □ 

3.4 Refinements and bibliographic remarks 

The adversarial framework studied in this chapter was originally inves- 
tigated in a full information setting, where at the end of each round the 
forecaster observes the complete los s vecto r (£i t t, • • • , &K,t)- We refer the 
reader to Ccsa-Bianc hi and Lueosi l200d ] for the history of this prob- 
le m. The Exp3 and E xp3.P strategies were introduced.] and analyzed 
by lAuer et al.l 2002bl ] , where the lower bound of Theorem 13.41 is also 



prove n. The proofs presented in this chapter are taken from iBubeck 



2010] . We now give an overview of some of the many improvements 



and refinements that have been proposed since these initial analyses. 



3.4.1 Log-free upper bounds 

One can see that there is a logarithmic gap between the pseudo-regret 
of Exp3, presented in Theorem 13.14 and the minimax lower bou nd of 
Theorem l3.41 This gap was closed bv lAudibert and Bubeckl [20091 ] . who 
constructed a new class of strategies and showed that some of them 
have a pseudo-regret of order \fnK. This new class of strategies, called 
INF (Implicitily Normalized Forecaster) , is based on the following idea. 
First, note that one can generalize the exponential weighting scheme 
of Exp3 as follows: given a potential function ifj, assign the probability 



Pi,t+i 



Lt 



This type of strategy is calle d a weighted average forecaster, see 



Cesa-Bianchi and Lugosil . 120061 . Chapter 2] . In INF the normalization 



is done implicitily, by a translation of the losses. More precisely, INF 
with potential ip assigns the probability Pij+i = ip(Ct — L^t), where 



In its original formulation the Exp3 strategy was denned as a mixture o f exponential 
weights with the uniform distribution on the set of arms. It was noted in IStoltj [20051 ] 
that this mixing is not necessary, see footnote 2 on p26 in iBubeckl |20ldl for more details 
on this. 
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Ct is the constant such that pt+i sum to 1. The key to obtain a min 
imax optimal pseudo-regret is to take ip of the form i p(x} = {—rix)~ q 



with q > 1, while Exp3 corresponds to ip(x) = exp(r)x). lAudibert et al 



20111 ] realized that the INF strategy can be formulated as a Mirror De- 



scent algorithm. This point of view significantly simplifies the proofs. 
We refer the reader to Chapter [5] (and in particular Theorem 15. 7p for 
more details. 

While it is possible to get log-free pseudo-regret bounds, the situa- 
tion becomes significantly more complic ated when one consider s high 
probability regret and expected reg ret. lAudibert and Bubeckl |2010l ] 



proved that one can get a log-free expected regret if the adversary 
is oblivious, i.e., the sequence of loss vectors is independent of the fore- 
caster's actions. Moreover, it is also possible to get a log-free high prob- 
ability regret if the adversary is fully oblivious (i.e., the loss vectors are 
independently drawn, but not identically distributed — this includes th e 
oblivious adversary). It is conjectured (in Audibert and Bubeck 201o( ] ) 
that it is not possible to obtain a log-free expected regret bound against 
a general non-oblivious adversary. 



3.4.2 Adaptive bounds 

One of the strengths of the bounds proposed in this chapter is also one 
of its weaknesses: the bounds hold against any adversary. It is clear 
that in some cases it is possible to obtain a much smaller regret than 
the worst case regret. For example, when the sequence of losses is an 
i.i.d. sequence, we proved in Chapter [2] that it is is possible to obtain 
a logarithmic pseudo-regret (provided that the gap A is considered as 
a constant). Thus it is natural to ask if it possible to have strategies 
with minimax optimal regret, but also with much smaller regret when 
the loss sequence is not worst case. 



The first bound in this direction was proved by lAuer et al.1 2002bl |. 



who showed that, for the gain version of the problem and against 
an oblivious adversary, Exp3 has a pseudo-regret of order ^jKG* n 
(omitting log factors), where G* n < n is the maximal cumulative re- 
ward of the optimal arm after n rounds. This result was improved by 
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Audibert and Bubeckl 20101 ] , who showed that using the gain estimate 



9i,t 



In 1 



P9i£ 
Pi,t 



one can bound the regret with high probability by essentially the same 
quantity as before, and against any adv ersary. 



Another direction was explored by lHazan and Kald 20091 ] build- 
ing on previous works in the full information setting — see 



Cesa-Bianchi et al.l 20071 ] . In this work the authors proved that one 
can attain a regret of order \JYld=i V%,n excluding log factors, where 



Vi. 



£ 

t=i 



n ^-^ J 

s=l / 



is the total variation of the loss for arm i. In fact their result is more 
general, as it applies to the linear bandit framework — see Chapter [5j 
The main new ingredient in th eir analysis is a "reser voir sampling" 
procedure. We ref er the reader to lHazan and K ale 20091 f° r details. See 



Slivkins 



20111 ] for related 



also the works of ISlivkins and Upfall [20081 ]. 
results on slowly changing bandits. 

In Section [3.4.41 below we describe another type of adaptive bound, 
where one combines minimax optimal regret for the adversarial model 
with logarithmic pseudo-regret for the stochastic model. 

3.4.3 Competing with the best switching strategy 

While competing against the policy consistently playing the best fixed 
arm is a natural way of defining regret, in some applications it might be 
interesting to consider regret with respect to a bigger class of policies. 
Though this problem is the focus of Chapter U there is a class of 
natural policies that can be directly dealt with by the methods of this 
chapter. Namely, consider the problem of competing against any policy 
constrained to make at most S < n switches (a switch is when the 
arm played at time t is differe nt from the arm played at time t + 1). 
This problem was studied by lAuerl 20021 ]. where it was first shown 
that a simple variant of Ex p3 attains a low switching r egret against 
oblivious adversaries. Later, Audibert and Bubeck 201oj ] proved that 
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Exp3.P attains an expected regret (and a high probability regret) of 
order \J nKS h\{nK/ S) for this problem. 

3.4.4 Stochastic versus adversarial bandits 

From a practical viewpoint, Exp3 should be a safe choice when we have 
reasons to believe that the sequence of rewards is not well matched 
by any i.i.d. process. Indeed, it is easy to prove that UCB can have 
linear regret, i.e. R„ = £l (n), on certain deterministic sequences. In 



Bubeck and Slivkins 



2012 ] a new strategy was described, called SAO 
(Stochastic and Adversarial Optimal), which enjoys (up to logarith- 
mic factors) both the guarantee of Exp3 for the adversarial model 
and the guarantee of UCB for the stochastic model. More precisely 
SAO satisfies R n = O (^ log 2 (n) log(i"Q) in the stochastic model and 

R n = O ^ V nK log 3 ^ 2 (n) log(ET)^ in the adversarial model. Note that 
while this result is a step towards more flexible strategies, the very 
notion of regret R n can become vacuous with nonstationarities in the 
reward sequence, since the total reward of the best fixed action might 
be very small. In that case the notion of switching regret — see Sub- 
section E3]3] — is more relevant, and it would be interesting to derive a 
strategy with logarithmic regret in the stochastic model, and a switch- 
ing regret of order V nKS in the adversarial model. 

3.4.5 Alternative feedback structures 

As mentioned at the beginning of this section, the adversarial multi- 
armed bandit is a variation of the full information setting, with a weaker 
feedback signal (only the incurred loss versus the full vector of losses is 
observed). Many other feedback structures can be considered, and we 
conclude the chapter by describing a few of them. 

In the label efficien t setting, originally proposed by 



Helmbold and Panizzal 19971 ] . at the end of each round the fore- 
caster has to decide whether to ask for the losses of the current round, 
knowing that this c a n be d one for at most m < n times. In this setting, 



Cesa-Bianchi et al 



2005] proved that the minimax pseudo-regret is 



of order n\/-^-. A bandit label efficient version was proposed by 
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Allenberg et al.1 [20061 ] . lAudibert and Bubeckl [2Q10T | proved that the 



minimax pseudo-regret for the bandit label efficient version is of order 
n \[^k^ These results do not require any fundamentally new algorithmic 
idea, besides the fact the forecaster has to randomize to select the 
rounds in which the losses are revealed. Roughly speaking, a simple 
coin toss with parameter e = m/n is sufficient to obtain an optimal 
reg ret. 



Mannor and Shamir! 201 If ] study a model that interpolates between 



the full information and the bandit setting. The basic idea is that there 
is an undirected graph G with K vertices (one vertex for each arm) 
that encodes the feedback structure. When one pulls arm i the losses 
of all neighboring arms j G in the graph are observed. Thus, a 

graph with no edges is equivalent to the bandit problem, while the 
complete graph is equivalent to the full information setting. Given the 
feedback structure G, it is natural to consider the following unbiased 
loss estimate 



r -i,t 



Using Exp3 with this loss estimate, the authors show that the mini- 
max pseudo-regret (up to logarithmic factors) is of order of y a(G)n, 
where a(G) is the independence number of graph G. Note that this 
interpolated setting naturally arises in applications like ad placement 
on websites. Indeed, if a user clicks on an advertisement, it is plausible 
to assume that the same user would have clicked on similar advertise- 
ments, had they been displayed. 

The above problems are all specific examples of the more general 
partial monitoring setting. In this model, at the end of each round the 
player does not observe the incurred loss ii ti t but rather a stochastic 
"signal" Si t t- A prototypical example of this scenario is the following: a 
website is repeatedly selling the same item to a sequence of customers. 
The selling price is dynamically adjusted, and each customer buys the 
item only if the current price is smaller or equal than his own hidden 
value for the item. The pricing algorithm (i.e., the player in our ter- 
minology) does not see each user's value, but only whether the user 
bought the item or not. 

The relationship between the signals and the incurred losses defines 
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the insta nce of a partial monitoring pro blem. We refer the interested 
reader to lCesa-Bianchi and Lugosil 20061 ] for more details, including an 
historical account. Rece nt progress on this problem has been made by 



Bartok et al.l [20101 ] and iFoster and Rakhlinl [20121 ] . 



4 



Contextual bandits 



A natural extension of the multi-armed armed problem is obtained by 
associating side information with each arm. Based on this side infor- 
mation, or context, a notion of "contextual regret" is introduced where 
optimality is defined with respect to the best policy (i.e., mapping 
from contexts to arms) rather than the best arm. The space of policies, 
within which the optimum is sought, is typically chosen in order to 
have some desired structure. A different viewpoint is obtained when 
contexts are privately accessed by the policies (which are then appro- 
priately called "experts"). In this case the contextual information is 
hidden from the forecaster, and arms must be chosen based only on 
the past estimated performance of the experts. 

Contextual bandits naturally arise in many applications. For exam- 
ple, in personalized news article recommendation the task is to select, 
from a pool of candidates, a news article to display whenever a new 
user visits a website. The articles correspond to arms, and a reward 
is obtained whenever the user clicks on the selected article. Side infor- 
mation, in the form of features, can be extracted from both user and 
articles. For the user this may include historical activities, demographic 
information, and geolocation; for the articles, we may have content in- 



43 



44 Contextual bandits 



formation and categories. See iLi et al.l (20icl | for more details on this 
application of contextual bandits. 

In general, the presence of contexts creates a wide spectrum of possi- 
ble variations obtained by combining assumptions on the rewards with 
assumptions on the nature of contexts and policies. In this chapter we 
describe just a few of the results available in the literature, and use the 
bibliographic remarks to mention all those that we are aware of. 



4.1 Bandits with side information 

The most basic example of contextual bandits is obtained when game 
rounds t = 1,2,... are marked by contexts s\,S2,--- from a given 
context set S. The forecaster must learn the best mapping g : S —> 
{1, . . . , K} of contexts to arms. We analyze this simple side information 
setting in the case of adversarial rewards, and we further assume that 
the sequence of contexts St is arbitrary but fixed. The approach we 
take is the simplest: run a separate instance of Exp3 on each distinct 
context. 

We introduce the following notion of pseudoregret 



R„ = max E 

g:S^{l,...,K} 



.t=l t=l 



Here St £ S denotes the context marking the i-th game round. A 
bound on this pseudoregret is almost immediately obtained using the 
adversarial bandit results from Sectional 



Theorem 4.1. There exists a randomized forecaster for bandits with 
side information (the 5-Exp3 forecaster, defined in the proof) that 
satisfies 

R S n < y/2n\S\K\n.K 
for any set S of contexts. 



Proof. Let S = \S\. The 5-Exp3 forecaster runs an instance of Exp3 on 
each context s £ S. Let n s the number of times when st = s within the 
first n time steps. Using the bound (13. 2D established in Theorem 13.11 
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_t : St=s 

K 



where in the last step we used Jensen's inequality and the identity 
Y is n s = n. □ 

In subsection 14.2.11 we extend this construction by considering several 
context sets simultaneously. 

A lower bound fi(V nSK^ is an immediate consequence of the ad- 
versarial bandit lower bound (Theorem l3.4|) under the assumption that 
a constant fraction of the contexts in S marks at least constant fraction 
of the n game rounds. 



we get 



max E 

g:S-4{l,...,K} 



t=l 



Emax E 
k=l K 



ses 

< V2nSKlnK 



4.2 The expert case 

We now consider the contextual variant of the basic adversarial bandit 
model of Chapter El In this variant there is a finite set of N random- 
ized policies. Following the setting of prediction with expert advice, no 
assumptions are made on the way policies compute their randomized 
predictions, and the forecaster experiences the contexts only through 
the advice provided by the policies. For this reason, in what follows 
we use the word expert to denote a policy. Calling this a model of 
contextual bandits may sound a little strange, as the structure of con- 
texts does not seem to play a role here. However, we have decided to 
include this setting in this chapter because bandit with experts have 
been used in practical contextu al bandit problems -see, e .g., the news 



recommendation experiment in iBevgelzimer et al.l 2011b] 



Formally, at each step t = 1,2,... the forecaster obtains the ex- 
pert advice (Ct)--->Ct )j where each £f is a probability distribution 
over arms representing the randomized play of expert j at time t. If 
It = • • • ,^K,t) £ [0, 1] K is the vector of losses incurred by the K 
arms at time t, then E^j^t denotes the expected loss of expert j at 
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time t. We allow the expert advice to depend on the realization of the 
forecaster's past random plays. This fact is explicitely used in the proof 
of Theorem 14,51 

Similarly to the pseudo-regret (|3.ip for adversarial bandits, we now 
introduce the pseudo-regret R n for the adversarial contextual bandit 
problem, 

n n 

,t=i t=i 

In order to bound the contextual pseudo-regret R^*, one could 
naively use the Exp3 strategy of Chapter [3] on the set of experts. This 
would give a bound of order y/nN log N. In Figure 14.11 we introduce 
the contextual forecaster Exp4 for which we show a bound of order 
\JnK In N. Thus, in this framework we can be competitive even with 
an exponentially large (with respect to n) number of experts. 

Exp4 is a simple adaptation of Exp3 to the contextual setting. Exp4 
runs Exp3 over the N experts using estimates of the experts' losses 
E. e j£it- In order to draw arms, Exp4 mixes the expert advice with 
the probability distribution over experts maintained by Exp3. The re- 
sulting bound on the pseudo-regret is of order \/nK In N , where the 
term \/ln N comes from running Exp3 over the iV experts, while \[K 
is a bound on the second moment of the estimated expert losses un- 
der the distribution q t computed by Exp3. Inequality (|4.6p shows that 
^j^qtVjt — ^i^pt^tf That is, this second moment is at most that of 
the estimated arm losses under the distribution p t computed by Exp4, 
which in turn is bounded by \f~K using techniques from Chapter 



R r , = max E 

n i=l,...,N 



Theorem 4.2 (Pseudo-regret of Exp4). Exp4 without mixing and 

wishes 

R^ x < V2nNlnK . (4.1) 



with j]t = J] = y satisfies 



On the other hand, with rjt = \l ^jg- it satisfies 



R„ x < 2\/nNlnK . (4.2) 
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Exp4 ( Exponential weights algorithm for Exploration and Exploita- 
tion with Experts) without mixing: 

Parameter: a non-increasing sequence of real numbers (ryt)tgN- 
Let qi be the uniform distribution over {1, . . . , N}. 
For each round t = 1,2, ... ,n 

(1) Get expert advice . . . , where each £;? is a proba- 
bility distribution over arms. 

(2) Draw an arm It from the probability distribution pt = 
{pi,t, • • • ,PK,t), where p ijt = %~ gt f - )t . 

(3) Compute the estimated loss for each arm 

li t = —ti t =i i = 1, • • • ,K . 

Pi,t 

(4) Compute the estimated loss for each expert 

y jjt = E i ^i£ i>t j = l,...,N. 

(5) Update the estimated cumulative loss for each expert 
Yj,t = E* s =i Vs,s for j = 1, . . . , iV. 

(6) Compute the new probability distribution over the ex- 
perts q t+ i = (qi,t+i, • ■ • , ?JV,t+l) ! where 

exp (-TjtYjA 

qj ' t+1 = J ^~ 

Lfc=i ex P -VtY k ,t 



Fig. 4.1 Exp4 forecaster. 



Proof. We apply the analysis of Exp3 (Theorem I3.1|) to a forecaster 
using distributions qt over N experts, whose pseudo- losses are for 
j = 1, . . . , N. This immediately gives the inequality 

" ~ 1 N 1 n 

t=l ^ 1 t=i 
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Now, similarly to (j3.5[) in the proof of Theorem 13. 1} we establish the 
following inequalities 

(4.4) 



^It~ptUk,t = E / t ~Pt E i~^^i,t = E i~£*At = Vk,t 



2 



(4.5) 

Ph,t 
(4.6) 



where we used Jensen's inequality to prove (I4.6D . By applying ()4.5 
and (JOJ) to (JO]) we get 



t=i t=\ Vn z t=\ 

Now note that, if we take expectation over the draw of 1%, . . . , I n , us- 
ing (|4.4p we obtain 



EY k>n = E 
Hence, 



^2^[yj,n\h,---,It-l] 



t=l 



max E 

k=l,...,N 



E 



i.t 



i=l 



logiV 
% 2 ^ 



Choosing 774 as in the statement of the Theorem, and using the inequal- 
ity Y^t=i £ - < 2i/w, concludes the proof. □ 

Besides pseudo-regret, the contextual regret 

(n n \ 

e^-e*wm 

can be also bounded, at least with high probability. Indeed, similarly to 
the variant Exp3.P of Exp3 (see Section [3~2|) . an analogous modification 
of Exp4, called Exp4.P, satisfies 



R c ^ < c^nKlniNS- 1 ) 

for some constant c > and with probability at least 1 — 6, where 
5 G (0, 1) is a parameter of the algorithm. 
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4.2.1 Competing against the best context set 

We revisit the basic contextual scenario introduced in Section 14. 1} 
where the goal is to compete against the best mapping from contexts 
to arms. Consider now a class {Sg : 9 G 0} of context sets. In this new 
game, each time step t = 1,2,... is marked by the vector {se,t) e& Q of 
contexts, one for each set in 0. Introduce the pseudoregret 

n n 
t=l t=l 

§ 

When |0| = 1 we recover the contextual pseudoregret R n . In general, 
when contains more than one set, the forecaster must learn both the 
best set Sg and the best function g : Sg — > {1, . . . , K} from that set to 
the set of arms. 

We find this variant of contextual bandits interesting because its so- 
lution involves a nontrivial combination of two of the main algorithms 
examined in this chapter: Exp4 and 5-Exp3. In particular, we consider 
a scenario in which Exp4 uses instances of 5-Exp3 as experts. The in- 
teresting aspect is that these experts are learning themselves, and thus 
the analysis of the combined algorithm requires taking into account the 
learning process at both levels. 

Note that in order to solve this problem we could simply lump all 
contexts in a big set and use the proof of Theorem 14.11 However, this 
would give a regret bound that depends exponentially in |0|. On the 
other hand, by using Exp4 directly on the set of all policies g (which is 
of cardinality exponential in |0| x \S\), we could improve this to a bound 
that scales with v/|0|- The idea we explore here is to use Exp4 over 
the class of "experts" , and combine this with the 5-Exp3 algorithm 
of Theorem 14.11 This gets us down to a logarithmic dependency on |0|, 
albeit at the price of a worse dependency on n. 

Intuitively, Exp4 provides competitiveness against the best context 
set Sg, while the instances of the 5-Exp3 algorithm, acting as experts 
for Exp4, ensure that we are competitive against the best function g : 
Sg — > {1, . . . , K} for each 6 6 0. However, by doing so we immediately 
run into a problem: the pt used by Exp4 is not the same as the pf's 
used by each expert. In order to address this issue, we now show that 



R„ = max max E 

eeO g:Sg^{l,...,K} 
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the analysis of Exp3 holds even when the sequence of plays I\, I2, ■ ■ ■ 
is drawn from a sequence of distributions qi,q2, • • • possibly different 
from the one chosen by the forecaster. The only requirement we need 
is that each probability in qt be bounded away from zero. 

Theorem 4.3. Consider a K-armed bandit game in which at each 
step t = 1,2,... the played arm It is drawn from an arbitrary dis- 
tribution qt over arms. Each qt may depend in an arbitrary way on 
the pairs {Ii,ti Xl \), . • • , {It-\,£i t _ lt t-\)- Moreover, q ti i > e > for all 
i = l,...,K and t > 1. 

If Exp3 without mixing is run with l^t = ^lj t =i and rjt = rj = 



where J n ~ q n means that each It is drawn from qt for t = 1, . . . ,n, 
and pt is the distribution used by Exp3 at time t. 

Proof. The proof is an easy adaptation of Exp3 analysis (Theorem 13.11 
in Section [3]) and we just highlight the differences. The key step is the 
analysis of the log- moment of In'- 



The first term is bounded in a manner slightly different from the proof 



The analysis of the second term is unchanged: Let L^o = 0, $o( r ?) = 
and $i(? ? ) = ^log^Ei Ii ex P (—vLi,tj- Then by definition of pt we 
have: 



^then 




(4.7) 




of Theorem 13. 1\ 
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Proceeding again as in the proof of Theorem 13.11 we obtain 

V Ph,t 



'.t=l 

Now observe that 



< E 



t=l 



+ 



InK 

V 



Therefore 



and 



E 



Ph,t 
h~qt 2 



2q\ 

K i 

_ Pi,t < 1 



Em 



Ern 



Tin lniT 

< — + 

- 2e r? 



Choosing r/ as in the statement of the theorem concludes the proof. □ 

It is left to the reader to verify that the analysis of 5-Exp in Theo- 
rem [JT] can be combined with the above analysis to give the bound 



max E/n^ n 

g:S-+{l,...,K} 



'2/7 

< d—\S\\nK 



(4.8) 

Next, we state a bound on the contextual pseudoregret of a variant 
of Exp4 whose probabilities p^t satisfy the property p^t > for all 
i = 1, . . . , K and t > 1, where 7 > is a parameter. This is obtained by 
replacing in Exp4 the assignment pn = Ej^ qt ^ t (line 2 in Figure [4~Tj) 
with the assignment 



Pi.t = (1 - 7)%~ ?t CL + 



2L 
K 



where 7 > is the mixing coefficient. This mixing clearly achieves the 
desired property for each pi jt . 



Theorem 4.4 (Pseudo-regret of Exp4 with mixing). Exp4 with 
mixing coefficient 7 and with rjt = T] = j/K satisfies 



-j5ctx 771 K\nN 

R n <^ + — r 



(4.9) 
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Proof. The proof goes along the same lines of Exp4 original 



proof [Auer et al.1 . l2002bl . Theorem 7.1] with the following modifica- 
tions: since the weights are negative exponentials, we can use the bound 
exp(— x) < l — x + ^Y for all x > rather than exp(x) < l + x + (e — 2)x 2 
for all < x < 1; the term (1 — 7) £4 ^fc,t is upper bounded directly by 
Y2t ^k,t'i the term £4 Si * s upper bounded by 7 n without requir- 
ing the assumption that the expert set contains the "uniform expert" . 
Finally, the fact that experts' distributions £j? depend on the realization 
of past forecaster's random arms is dealt with in the same way as in 
the proof of Theorem 14.21 □ 



Theorem 4.5. There exists a randomized forecaster achieving 



R B = 



^n 2/3 (mag\S e \K In K^J ' V^hajGlJ 



for any class {Sg : 9 S G} of context sets. 



Proof. We run the Exp4 forecaster with mixing coefficient 7 using in- 
stances of the 5-Exp3 algorithm (defined in the proof of Theorem 14. 1[) 
as experts. Each 5-Exp3 instance is run on a different context set Sg 
for 6 £ O. Let £f be the distribution used at time t by the 5-Exp3 
instance running on context set Sg and let p n be the joint distribution 
of I n = (ii, . . . , I n ) used by Exp4. Since p^t > for all i = 1, . . . , K 
and t > 1, we can use (14. 8h with e = 7/iif. Thus, Theorem 14.41 implies 



E 



E- 



< minE/n^n 



*=i 



711 K In 1 1 

+ v+ — 

2 7 



< min min E 
068 siSe-Kl,...,*:} 



E 

,t=i 



sOt),* 



+ 



In 

— max |<Sfl| lnif + 

e see 



771 iOnlOl 



Substituting e = 7/-K" in the above expression and choosing 7 of the 
order of rcT 1 / 3 (max^e \Sg\K In K) 1 ^ 3 y/ln |0| gives the desired result. 
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Note that in Theorem 14.51 the rate is n 2 / 3 , in contrast to the more 
usual n 1//2 bandit rate. This worsening is inherent in the Exp4-over- 
Exp3 construction. It is not known whether the rate could be improved 
while keeping the same logarithmic dependence on |0| guaranteed by 
this construction. 

4.3 Stochastic contextual bandits 

We now move on to consider the case in which policies have a known 
structure. More specifically, each policy is a function / mapping the 
context space to the arm space {1, . . . , K} and the set T of policies is 
given as an input parameter to the forecaster. 

Under this assumption on the policies, the problem can be viewed 
as a bandit variant of supervised learning. For this reason, here and in 
the next section we follow the standard notation of supervised learning 
and use x rather than s to denote contexts. 

In supervised learning, we observe data of the form (xt,£t)- In the 
contextual bandit setting, the observed data are (xt,£i tt t) where It is 
the arm chosen by the forecaster at time t given context xt € X. This 
connection to supervised learning has steered the focus of research to- 
wards stochastic data generation models, which are widespread in the 
analysis of supervised learning. In the stochastic variant of contextual 
bandits, contexts xt and arm losses it = (&l,u ■ ■ ■ y^K,t) are realizations 
of i.i.d. draws from a fixed and unknown distribution D over X x [0, 1] . 
In tight analogy with statistical learning theory, a policy / is evaluated 
in terms of its statistical risk loif) = ^(x,e)~D^f(x)- Let 

/* = arginf £ D (f) 

the risk-minimizing policy in the class. The regret with respect to the 
class J 7 of a forecaster choosing arms I± , I2 , . . . is then defined by 

n 

^2ii t ,t-n£ D (f*) ■ 
t=i 

This can be viewed as the stochastic counterpart of the adversarial 
contextual regret R n introduced in Section 14.21 The main question is 
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now to characterize the "price of bandit information" using the sample 
complexity of supervised learning as yardstick. 

In the rest of this section we focus on the case of K = 2 arms and 
param etrize classes T of polici es / : X — > {1, 2} by their VC-dimension 



d — see Boucheron et al. I [20051 ] for a modern introduction to VC theory. 



For this setting, we consider the following forecaster. 



VE (VC dimension by Exponentiation): 

Parameters: number n of rounds, n' satisfying 1 < n' < n. 

(1) For the first n' rounds, choose arms uniformly at random. 

(2) Build J' C J such that for any / G T there is exactly 
one /' G T' satisfying f(xt) = f'(xt) for all t = 1, . . . , n'. 

(3) For t = n' + 1, . . . , n play by simulating Exp4.P using the 
policies of J 7 ' as experts. 



We now show that the per round regret of VE is of order w d/n, ex- 
cluding logarithmic factors. This rate is equal to the optimal rate for 
supervised learning of VC-classes, showing that — in this case — the 
price of bandit information is essentially zero. 

Theorem 4.6. For any class T of binary policies / : X — > {0, 1} of 
VC-dimension d and for all n > d, the forecaster VE run with n' = 



n 



(2dln^+lnf) satisfies 



Y^i Ia -niuU D {f) < c ^n(dln^ + ln~) (4.10) 

for some constant c > and with probability at least 1 — 5 with respect 
to both the random data generation and VE's internal randomization. 

Proof. Given a sample realization (xi, £i), ■ ■ ■ , (x n , £ n ), let /' the unique 
element of T' such that f{x%) = f*(%t) f° r an t = 1, where 
/* is the risk-minimizing function in J 7 . Given a sample, we may as- 
sume without loss of generality that T contains functions restricted 
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on the finite domain jxi , . . . ,x n }. Recall Sauer-Shelah lemma — see, 
e.g. iBoucheron et al.l 2005J], stating that any class J- of binary func- 
tions denned on a finite domain of size n satisfies \ J-\ < f^r) , where 
d is the VC-dimension of T . Then, with probability at least 1 — | with 
respect to VE's internal randomization, 



/ 3 1 I 

Y^h,t<n'+ Y f 'f'{x t ),t + c^2{n-n')\n^ 

t=l t=n'+l 



" ^ / 3 1 -F 1 1 

<n+ Y ( £ f(x t ),t + tf'(x t ),t -tf*(x t ),t) + c^j 2(n -n>) In— ^- 
t=n'+l 



<n'+ Y {Zf*{xt),t + 1 f'(x t )^f*(x t )) +c\j2(n-n')\n- 
t=n'+l 



<n'+ Y ( e r(xt),t + ^f'(x t W(x t )) +c \ 2n ( dln ^J +ln f) 

t=n'+l V \ ■■ / 

where we used i^t € [0, 1] in the penultimate step and the Sauer- 
Shelah lemma in the last step. Now, the term Y2t &f*(xt),t is controlled in 
probability w.r.t. the random draw of the sample via Chernoff bounds, 



Y ef*( xt lt>(n-n')t D (n + ^^\n~)<6. 

\t=n'+l 

Hence, 

n 

Y £ iut<n +n£ D (f) 
t=\ 



+ Y 1 f'(xtW(x t )+C\hn ( din ^+lnyj 

t=n'+l V V / 

with probability at least ^ with respect to both the random sample 
draw and VE's internal randomization. 

The term Y^t ^ f'{x t )^f*{x t ) quantifies the fact that the unique func- 
tion /' 6 J 7 ' that agrees with /* on the first n' data points is generally 
different from /* on the remaining n — n' points. Since each data point 
(xt,£t) is drawn i.i.d., the distribution of a sequence of n pairs remains 
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the same if we randomly permute their positions after drawing them. 
Hence we can bound Ylt^-f'(x t )^f*(xt) m probability w.r.t. a random 
permutation a of {1, . . . ,n}. Let ||/ - g\\ = T,t=i 1 f'(x t )^f*(xt)- Then 

F * t i'^(t)W{x a(t) ) >k\ 

\t=n'+l / 

<¥ ff (3f,g€T, \\f-g\\ > k : f(x a{t) ) = g(x a{t) ), t = 1, . . . , ri) 



< W i 



n 



< 
5 

< - 
~ 3 



/en\ 2d ( kn'\ 



for 



n ( „ en , 3\ 
fc > — 2d In — +ln- . 
n' \ a o / 

Now, since we just proved that 

En / en 3\ 
1 f'(x^ t) W(x a(t) ) < ^7 ( 2dln — + In - I 

t=n'+l V 7 

holds with probability at least | for any sample realization, it holds 
with the same probability for a random sample. Hence, by choosing n' 
as in the statement of the theorem and overapproximating, we get the 
desired result. □ 



4.4 The multiclass case 

A different viewpoint on contextual bandits is provided by the so-called 
bandit multiclass problem. This is a bandit variant of the online proto- 
col for multiclass classification, where the goal is to sequentially learn a 
mapping from the context space R d to the label space { 1 , . . . , K } , with 
K > 2. In this protocol the learner keeps a classifier parameterized by 
a K x d matrix W . At each time step £ = 1,2,... the side information 
x t G M. d is observed (following standard notations in online classifica- 
tion, here we use x instead of s to denote contexts), and the learner 
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predicts the label % maximizing (Wxt) . over all labels i = 1, . . . , K. 
In the standard online protocol, the learner observes the true label yt 
associated with x t after each prediction, and uses this information to 
adjust W. In the bandit version, the learner only observes ly tf L yt ; that 
is, whether the prediction at time t was correct or not. 

A simple but effective learning strategy for (non-bandit) online clas- 
sification is the multiclass Perceptron algorithm. This algorithm up- 
dates W at time t using the rule W <— W + X t , where X t is a K x d 
matrix with components (^t)j ^ = %t,j(^y t =i ~ % t =i)- Therefore, the 
update rule can be rewritten as 

Wy, Wy f + X t 

w yt <- wg t - x t 

Wi <— w i for all i / y t and i j^y t 

where Wi denotes the i-th row of matrix W. Note, in particular, that 
no update takes place (i.e., X t is the all zero matrix) when y t = yt, 
which means that yt is predicted correctly. 

A straightforward generalization of the Perceptron analysis gives 
that, on any sequence of (x\,yi), (x2, 2/2), . . . G K d x {1, . . . , K} such 
that ||xt|| = 1, the number of classification mistakes satisfies the fol- 
lowing notion of regret, 

H^m < ™? ( L n(U) + 2 \\U\\ 2 + \\U\\ y/2nL n (U)) 
t=i ^ ' 

uniformly over n > 1, where the infimum is over all K x d matrices U 
and || • || denotes the Frobenius norm. Here L n (U) denotes the cumu- 
lative hinge loss of policy U, 

n n 

L n (U) = = Ei 1 " ( Ux t)y t +max(C/x t ).l 

t=i t=i % ^ Vt + 

where [■] + = max{0, •} is the hinge function. Finally, L n (U) = 
^L n (U) is the average hinge loss of U. 

Note that £t(U) = if and only if {Uxt) > 1 + maxj^ (Uxt) v 
which can only happen when y t = y t = argmax^C/xt)^. Moreover, 
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£t(U) > 1 if and only if y t / y t . This means that it is a convex up- 
per bound on the mistake indicator function \^ t ^ yt for the multiclass 
classifier represented by U. 

We now introduce a bandit variant of the multiclass Perceptron 
called Banditron. 



Banditron 

Parameter: number 7 £ (0, 

Initialize: Set W\ to the zero K x d matrix. 

For each round t = 1, 2, . . . , n 

(1) Observe x t G ]R d . 

(2) Set ^ = argmax(Wf xt) . 

i=l,...,A- 

(3) Predict If € drawn from distribution pt 
{Pi,t, ■ ■ ■ ,PK,t) such that p i: t = (1 - 7)%=i + 7?- 

(4) Observe \y t=yt . 

(5) Update W t+ i =W t + X t where 



{*t)i,j - X t J &y t =i 

V Pi,t J 



The Banditron operates in the bandit variant of the online proto- 
col for multiclass classification. As Xt depends on the true label yt, 
which is only available when the classification is correct, the Ban- 
ditron computes an estimate X t of X t via a randomized technique 
inspired by the nonstochastic multiarmed bandit problem. The label 
yt = argmax^W^Xf)^ is used to make the prediction at time t only 
with probability 1 — 7, whereas with probability 7 a random label is 
predicted at each time t. 

We now analyze the expected number of prediction mistakes made 
by the Banditron algorithm on any sequence of examples (xt, yt) - Unlike 
the non-bandit case, where the number of mistakes M n after n steps of 
the multiclass Perceptron provides a "multiclass regret" bound M n — 
L n (U) = 0( y ^/nj, in the bandit case the regret achieved by the variant 
of the Perceptron is only bounded by O (n 2//3 ) . 
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Theorem 4.7. If the Banditron algorithm is run with parameter 7 = 
(K/n) 1 / 3 on any sequence (#i, yi), . . . , (x n , y n ) G M. d x {— 1,+1} of 
examples such that n > 8K and \\xt\\ = 1, then the number M n of 
prediction mistakes satisfies 

EM n < inf ^L n (C/) + ^1 + \\U\\ ^2L n (U)^j K l / 3 n 2 ' 3 

+ 2\\U\\ 2 K^n 1 / 3 + V2\\U\\K 1 / (i n 1 / 3 



where the infimum is over all K x d matrices U and L n (U) = ^L n (U) 
is the average hinge loss of U. 

Proof. We need to bound M = Ylt^Yt^yt- Let E t be the expectation 
conditioned on the first t — 1 predictions. We start by bounding the 
first and second moments of X t , 



k=i V Pfe >* 
= x t,j{ 1 yt=i ~ %=i) = ( X t)i,j ■ 

For the second moment, note that 



K d 

i=i j=i 

E 

i=i 



tY t =yt 1 Y t =i 



- 1 



yt=* 



L vt= 



where 



K 

E 

i=l 



lY t =yt 1 Y t =i 
Pi,t 



L yt=i 



— + 1 I if r t = y t / 
1 ^ 2 



1 ) HY t = y t =y t 
otherwise. 
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Therefore, if yt^yt, 



E t \\X t \\ 2 = p yt J J- + 1 ) + (1 - Pyut ) 
\ p yt,t J 

1 K 2K 
= 1 + = 1 + — < 

Py,t 7 7 

because p iit = 7 when yt^yt- Otherwise, if y t = y t 

^t\\X t \\ 2 =Py t ,t(^--l) + (1 -Py u t) 

1 1 

- 1 = 1 < 2 7 



Pw,t 1 - 7 

because p^t = 1 — 7 when y t = y t and 7 < ^ . Hence, 



E,||-Y t || 2 <2(^l w ^+7l w=gi 



We are now ready to prove a bound on the expected number of mis- 
takes. Following the standard analysis for the Perceptron algorithm, 
we derive upper and lower bounds on the expectation of the quantity 
(U, Wn+i) = tr(U Wj +1 j , for an arbitrary K xd matrix U. First, using 
Cauchy-Schwartz and Jensen inequalities we obtain 

E(U,W n+1 ) < ^/||tf|| 2 E||W n+ i|| 2 . 

Now 

E n [\\W n+1 \\ 2 ] =E n [\\W n \\ 2 + 2(W n ,X n ) + \\X n \\ 2 
< \\W n \\ 2 + K n \\X n \\ 2 . 

In order to see why the inequality holds, note that 

K 

E n (W n ,X n ) = (W n ,X n ) = J2( W n x t)i^y n =i ~ 

= (W n x n ) - (W n x n ) g < 

v / yn v ' yn 
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because y n = argmaxj =li ... j j<-(W n a; ri ) j by definition. Therefore, since 
W\ is the zero matrix, 



E||^ n+1 || 2 < J>||X n || 2 
t=i 

< 2 (-Hvt ¥= vt) +i®{vt = yt)) 
*=i ^ ^ ' 



2K 



7 



Thus we have 

^(u,w n+1 ) < \\u\\ 



n . 



Now we lower bound (f/, W n+ \) as follows, 

E n (C7, W n+ i) = K n (U, W„ + X n > 
= (U,W n ) + (U,X n ) 
>(U,W n ) + t yt ^ t -£ t (U) 

because, by definition of £ t , 

£ t (U)= [l-{Ux t ) yt +m^{Ux t ) i 

> 1 - (Ux t ) yt + (Ux t ) gt 
>l yt ^ t -{Ux t ) yt + {Ux t ) Vt 

= t yt ^ t -(u,x t ) . 

Therefore, using again the fact that W\ is the zero matrix, 

n n 

^(u, w n+1 ) > J2Hvt ¥= yt) - • 
t=i t=i 

Combining the upper and lower bounds on (U, W n +i) we get 



52Hyt^yt)-L n (U)< 



t=i 



2K 

7 



t=i 
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Solving for £) t P(yt 7^ J/t) and overapproximating yields 



IK I IK 

VP(|/t / yt) < L n (U) + \\U\\ 2 + ||C/|| J L n (U) + 2 7 n 

^ 7 V 7 




2A" / / 2.K" 

L n (U) + — \\U\\ 2 + \\U\\ J ( — L n {U) + 2 7 ) n 



Now, since P(y t / F t ) = (1 - 7 )P(y t + y t ) + 7, 



£>(lfc + Yt) < L n (U)+ 7 n+^- ||f/|| 2 +||C/|| y (^-L n (U) + 2^jn. 

Choosing 7 as in the statement of the theorem yields the desired result. 
Note that 7 < \ because we assume n > 8K. □ 
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Linear bandits 



We now consider the important generalization of adversarial bandits 
where the set of arms {1, . . . , K} is replaced by a compact set /C C M. d . 
In this case, the loss at each step is some function defined on /C, and the 
task is to pick an arm as close as possible to the minimum of the loss 
function at hand. In order to allow sublinear regret bounds, even in the 
presence of infinitely many arms, we must assume some structure for 
the loss function. In particular, in this chapter we assume that the loss 
at each time step is a linear function of arms. Linearity is a standard as- 
sumption (think, for instance, of linear regression) and naturally occurs 
in many bandit applications. The source routing problem mentioned in 
the introduction is a good example, since the cost of choosing a rout- 
ing path is linear in the cost of the edges that make up the path. This 
defines the so-called online linear optimization setting: at each time 
step t = 1,2, ... the forecaster chooses xt € /C while, simultaneously, 
the adversary chooses 1% from some fixed and known subset C of R d . 
The loss incurred by the forecaster is the inner product xj £f In this 
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chapter we focus on the analysis of the pseudo-regret 



n 



n 



t — 



minE x T if 



t=i 



t=i 



where the expectation is with respect to the forecaster's internal ran- 
domization. The adversarial bandit setting of Chapter [3] is obtained 
by choosing K. = {ei, . . . , e^}, where e\, . . . , is the canonical basis 
of R d , and C = [0, l] d . Similarly to Chapter El we focus on the bandit 
feedback where the forecaster only observes the incurred loss xj it at 
the end round t. However, we also discuss the full information setting, 
where the complete loss vector it is revealed at the end of each round 
t, as well as other feedback models. 

It is important to note that, without any loss of generality (as far 
as regret bounds are concerned), one can always assume that /C has 
size 0(n d ). Indeed, since K, is a compact set and the loss is linear 
(and therefore Lipschitz), one can cover K, with 0(n d ) points such 
that one incurs a vanishing extra cumulative regret by playing on the 
discretization rather than on the original set. Of course, the downside of 
this approach is that a strategy resulting from such a cover is often not 
computationally efficient. On the other hand, this assumption allows 
us to apply ideas from Chapter [3] to this more general setting. We 
analyze the pseudo-regret for finite classes in Section |5~T1 Without loss 
of generality, it is also possible to assume that K. is a convex set. Indeed, 
the pseudo-regret is the same if one plays xt, or if one plays a point 
at random in K, such that the expectation of the played point is x%. 
This remark is critical, and allows us to develop a powerful technology 
known as the Mirror Descent algorithm. We describe this approach in 
Section [BT21 and explore it further in subsequent sections. 

5.1 Exp2 (Expanded Exp) with John's exploration 

In this section we work under the bounded scalar loss assumption. That 
is, we assume that /C and C are such that \x T i\ < 1 for any (x,i) G 
ICxC. Moreover, we restrict our attention to finite sets K., with \ fC\ = N. 
Without loss of generality we assume that JC spans If it is not the 
case, then one can rewrite the elements of IC in some lower dimensional 
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vector space, and work there. Note that a trivial application of Exp3 
to the set K, of arms gives a bound that scales as V nN In N. If K, is a 
discretized set (in the sense described earlier), then N is exponential 
in d. We show here that, by appropriately modifying Exp3, one can 
obtain a polynomial regret of order \Jnd InN. 

To describe the strategy, we first need a useful result from convex 
geometry: John's theorem. This result concerns the ellipsoid £ of min- 
imal volume enclosing a given convex set K, (which we call the John's 
ellipsoid of fC). Basically, the theorem states that £ has many contact 
points with fC, and these contact points are "nicely" dist ributed, in the 



sense that they almost form an orthonormal basis — see lBa.ll! 19971 ] for 
a proof. 



Theorem 5.1 (John's theorem). Let /C C M. d be a convex set. If 
the ellipsoid £ of minimal volume enclosing K, is the unit ball in some 
norm derived from a scalar product (•, •), then there exist M < ^d(d + 
1) + 1 contact points u\,. . . ,um between £ and JC, and a probability 
distribution (ui, . . . ,/jlm) over these contact points, such that 

M 

x = d~^2 [ii(x,Ui)ui Vx G R. d . 
i=i 



In fact John's theorem is a if and only if but here we only need the 
direction stated in the theorem. We are now in position to describe the 
strategy. Let Conv(S') be the convex hull of a set 5 6 l 61 . First, wc 
perform a preprocessing step in which the set tC is rewritten so that 
John's ellipsoid of Conv(/C) is the unit ball for some inner product 
(■)■) ■ The loss of playing x G K. against £ £ £ is then given by (x,£). 
See Bubeck et al. |2012a| for the details of this transformation. Let 



«i, ... , um G K, and (ui, . . . , hm) satisfy Theorem 15.11 for the convex 
set Conv(/C). 

Recall from Chapter that the key idea to play against an adversary 
is to select xt at random from some probability distribution pt over K. 
We first describe how to build an unbiased estimate of it, given such a 
point Xt played at random from p t (such that pt(x) > for any x G K). 
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Recall that the outer product x (g) x is defined as the linear mapping 
from M. d to M d such that x x (y) = (x, y) x. Note that one can also 
view x® x as a dxd matrix, so that the evaluation of x®x is equivalent 
to a multiplication by the corresponding matrix. Now let 

Pt = pt{x)x ® X . 

Note that this matrix is invertible, since K, is full rank and pt(x) > 
for all x & JC. The estimate for £ t is given by It = Pf 1 (xt ® xt) if Note 
that this is a valid estimate since (xt <8> ^t)^ = (xt,£t)xt and P t _ are 
observed quantities. Also, it is clearly an unbiased estimate. 

Now the Exp2 algorithm with John's exploration corresponds to 
playing according to the following probability distribution 

exp (-7) Es=i(^ ) JH, 
Pi (x) = (l- 7 ) ^— / M +7& 1 ^ 

where r/, 7 > are input parameters. Note that this corresponds to 
a variant of Exp3 using (x,£t) as loss estimate for x 6 /C, and an 
exploration distribution supported by the contact points. 



Theorem 5.2 (Pseudo-regret of Exp2 with John's exploration). 

For any 7/, 7 > such that rjd < 7, Exp2 with John's exploration 
satisfies 

_ hxN 

Rn < 2771 H h 7?71U • 

r/ 

In particular, with r/ = and 7 = rjd, 

R n < 2V3ndlnN . 



Proof. Since K, is finite, we can easily adapt the analysis of Exp3 in 
Theorem 13.11 to take into account the exploration term. This gives 

In N n 
Rn < 2 7 n + ^ + 7?E^^p t (x)(x,^) 2 
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whenever rj(x,£t) < 1 for all x £ 1C. We now bound the last term in 
the right-hand side of the above inequality. Using the definition of the 
estimated loss 1% = P t _1 (xt ® Xf) it, we can write 

J2pt(x){xJ t ) 2 = ^2p t (x)(£ t ,(x®x)7 t ) 

x£K, x^K. 

= (£ t ,PJt) 

= (x t ,£ t ) 2 (P t ~ 1 x u P t P t - 1 x t ) 
< (P^x u x t ) . 

Now we use a spectral decomposition of P% in an orthonormal basis 
for (-, •) and write Pt = Yli=i ^i y i ® v i- ^ n particular, we have P^ 1 = 
Ya=i J~ Vi ® Vi an( ^ ^hus 



E(P t 1 x t , x t ) = V" -T-E((«j (8) Vi) x t ,x t ) 

-. \ 
i=i 



d 1 

y] T-E((x t <g> X t ) Vi,' 

■ -, Ai 



1=1 



{P t Vi,Vi) 

i=l Ai 
d 1 



1 A 

i=i 



Finally, to show 77 (x,£t) < 1 observe that 

(x,£ t ) = {xt,£t){x,P^x t ) < (x,Pf l x t ) < 



1 



mini<i< d \ ' 



where the last inequality follows from the fact that (x,x) < 1 for any 
x G fC, since /C is included in the unit ball. To conclude the proof, we 
need to lower bound the smallest eigenvalue of Pj. Using Theorem 15. 1\ 
one can see that Pt >: %Ia, and thus A« > %. Since r\d < 7, the proof is 
concluded. □ 
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5.2 Online Mirror Descent (OMD) 

We now introduce the Online Mirror Descent (OMD) algorithm, a pow- 
erful generalization of gradient descent for sequential decision problems. 
We start by describing OMD for convex losses in the full information 
setting. That is, £ is a set of convex functions, and at the end of round 
t the forecaster observes it G C rather than only it(xt). 

The rest of this chapter is organized as follows. Next, we briefly 
recall a few key concepts from convex analysis. Then we describe the 
OMD strategy and prove a general regret bound. In Section 15.31 we 
introduce Online Stochastic Mirror Descent (OMSD), which is a variant 
of OMD based on a stochastic estimate of the gradient. We apply this 
strategy to linear losses in two different bandit settings. Finally, in 
Section 15.51 we show how OMSD obtains improved bounds in certain 
special cases. The case of convex losses with bandit feedback is treated 
in Chapter [6J 

We introduce the following definitions. 

Definition 5.1. Let X C R d . A function / : X -)• R is subdifferen- 
tiable if for all x € X there exists g 6 R d such that 

f(x)-f(y)<g T (x-y) Vy e X . 

Such a g is called a subgradient of / at x. 

Abusing notation, we use V/(x) to denote both the gradient of / at 
x when / is differentiate, and any subgradient of / at x when / is 
sub differ entiable (a sufficient condition for subdifferentiability of / is 
that / is convex and X is open). 

Definition 5.2. Let / : X — > R be a convex function defined on a 
convex set X C R d . The Legendre-Fenchel transform of / is defined by: 

f*(u) = sup (x T u - f(x)) . 



This definition directly implies the Fenchel- Young inequality for convex 
functions, u T x < f(x) + f*(u). 

Let T> C R d be an open convex set, and let T> be the closure of T>. 
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Definition 5.3. A continuous function F : T> — > R is Legendre if 

(i) F is strictly convex and admits continuous first partial 
derivatives on D; 

(ii) lim ||VF(x)|| = +oo0 

x->-T>\T> 



The Bregman divergence Dj?:DxD-S'R associated with a Legen- 
dre function F is defined by Dp(x, y) = F(x) — F(y) — (x — y) T VF(y). 
Moreover, we say that D* = VF(2?) is the dual space of T> under F . 
Note that, by definition, Dp(x,y) > if x ^ y, and Dp{x,x) = 0. The 
following lemma i s the key to understand how a L egendre function act 



on the space. See |Cesa-Bianchi and Lugosil . 120061 . Proposition 11.1] for 
a proof. 

Lemma 5.1. Let F be a Legendre function. Then F** = F, and 
VF* = (VF)~ 1 restricted on the set T>* . Moreover, for all x,y eP, 

D F (x, y) = D F , (VF(y),VF(x)) . (5.2) 

The gradient VF maps V to the dual space V* , and VF* is the inverse 
mapping from the dual space to the original (primal) space. Note also 
that (|5.2p shows that the Bregman divergence in the primal exactly 
corresponds to the Bregman divergence of the Legendre-transform in 
the dual. 

The next lemma shows that the geometry induced by a Bregman di- 
verg ence resembles to the geometry of the squared Euclidean distance. 



See Cesa-Bianchi and Lugosil . 120061 . Lemma 11.3] for a proof. 



Lemma 5.2 (Generalized Pythagorean inequality). Let /CCD 
be a closed convex set such that K. n V ^ 0. Then, for all x G V, the 
Bregman projection 

z = argmin Dp(y,x) 
yelC 



1 By the equivalence of norms in R d , this definition does not depend on the choice of the 
norm. 
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exists and is unique. Moreover, for all z £ /C n T> and y £ 1C, 
D F (y, x) > D F (y, z) + D F (z, x) . 



The idea of OMD is very simple: first, select a Legendre function F on 
V D /C (such that JCnV ^ 0); second, perform a gradient descent step, 
where the update with the gradient is performed in the dual space T>* 
rather than in the primal T>; third, project back to K, according to the 
Bregman divergence defined by F. 

OMD (Online Mirror Descent): 

Parameters: compact and convex set K C R d , lear ning rate r] > 0, 

Legendre function F on T> D /C. 

Initialize: x\ £ argmini ? (x) (note that x\ £ K. n P). 

For each round t = 1, 2, . . . , n 

(1) Play xt and observe loss vector if 

(2) w m = X7F*(vF(x t ) - V V£t(x t )y 

(3) s t+ i = &rgmmD F (y,w t+ i). 

y&K 

Note that step (2) is well defined if the following consistency con- 
dition is satisfied: 

VF(x) - nV£(x) £ V* V(x, £) £ [K D 2?) x C . (5.3) 

Note also that step (2) can be rewritten as 

VF(w t +i) = VF{x t ) - rjVltixt) . (5.4) 

Finally, note that the Bregman projection in step (3) is a convex pro- 
gram, in the sense that x i— > D F (x,y) is always a convex function. This 
does not necessarily imply that step (3) can be performed efficiently, 
since in some cases the feasible set fC might only be described with an 
exponential number of constraints (we encounter examples like this in 
Section 

In the description above we emphasized that F has to be a Legendre 
function. In fact, as we see in Theorem 15.41 if F has effective domain 
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K, (that is, F takes value +00 outside of fC), then it suffices that the 
Legendre-Fenchel dual F* is differentiable on M. d to obtain a good regret 
bound. See the end of this section for more details on this. 

When K, is the simplex and F{x) = Yli=i x i m x i~ Yli=i x «> OMD re- 
duces to an exponentially weighted average forecaster, similar to those 
studied in Chapter [3l The well-known online gradient descent strat- 
egy corresponds to taking F(x) = \ \\x\\l^. In the following we shall see 
other possibilities for the Legendre function F. 

We prove now a very general and powerful theorem concerning the 
regret of OMD. 



Theorem 5.3 (Regret of OMD with a Legendre function). 

Let /C be a compact and convex set of arms, £ be a set of subdiffer- 
entiable functions, and F a Legendre function defined on T> D JC, such 
that (15. 3p is satisfied. Then OMD satisfies for any x £ /C, 

£Ma,)-i> M < FW - F( *'> 

t=l i=l ' 

+ ^J2D F *(vF(x t )-r]V£ t (xt),VF(xtf) . 



Proof. Let x E K. Since £ is a set of subdifferentiable functions, we 
have: 

n n 

^2[£ t (x t ) - £t(x)) <^V^(x t ) T (x 4 -x) . 
t=l t=l 

Using ()5.4p . and applying the definition of Dp, one obtains 

V V£ t (x t ) T {x t -x) = {x- x t ) T (VF{w t+1 ) - VF(x t )) 

= D F (x, x t ) + D F (x t ,w t+ i) - D F (x, w t+ i) ■ 

By LemmaE21 one gets D F (x,w t+ i) > D F (x,x t+ i) + D F (x t +i, w t +i), 
hence 

ryVltixt) 1 (x t - x) < D F (x,x t ) + D F (x t ,w t+ i) 

- D F (x,x t+ i) - D F (x t+ i,w t+ i) ■ 
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Summing over t then gives 

n 

y^??V 'lt(xt) 1 (x t - x) < D F (x,xi) - D F (x,x n+1 ) 
t=i 

n 

+ (D F (x t ,w t+ i) - D F (x t +i, wt+ij) • 
t=i 

By the nonnegativity of the Bregman divergences, we get 

n n 

} j r]V£ t (x t ) T (x t - x) < D F (x,xi) + '^2D F (x t ,w t+ i) . 
t=l t=i 

From (EH), one has D F (x t ,w t+ i) = D F *(VF(x t ) - r]V£t(x t ),VF(a t )) 
and, moreover, by first-order optimality, one has D F (x,xi) < F(x) — 
F(xi), which concludes the proof. □ 

We show now how to prove a regret bound if F has effective domain 
K, and F* is differentiate on but not necessarily Legendre. In this 
case, it is easy to see that step (1) and step (2) in OMD reduce to 



x t+1 = VF* (-r]jye s (x s )\ 



Theorem 5.4 (Regret of OMD with non-Legendre function). 

Let IC be a compact set of actions, £ be a set of subdifferentiable 
functions, and F a function with effective domain IC such that F* is 
differentiable on M. d . Then for any x £ IC OMD satisfies 

±i t (x t )-±i t (x)< F ^- F ^ ) 
t=i t=i ^ 

1 n / t t—1 \ 

+ -^2d f J -^V4(x s ),-^V4(i s ) . 

^ t=l ^ S=l 8=1 ' 



Proof. Let x G IC. Since £ is a set of subdifferentiable functions, we 
have 

n n 

^2{£ t (x t ) - £ t (x)) <^V^(x t ) T (x t -x) . 
t=i t=i 
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Using the Fenchel- Young inequality, one obtains 

n / n \ 

-7] jy£ t (xt) T x < F{x) + F*i-riY, W t (s t ) 

t=i V t=i J 

= F(a) + F*(0) 

Observe that F*(0) = — F(xi) and, for each term in the above sum, 

VF* ^- V jy£ s (x s )\ (-riVltixt) 

+ Dp* (-r?^V4(^),-r/^V4(^) ) 
\ S=l s=l / 

'V£t(x t ) + Z?f* (-V E V4(^ s ), -r? ^ V£ s (x s )\ 

\ s=l s=l / 



t-1 



-T]xJ' 



5.3 Online Stochastic Mirror Descent (OSMD) 

We now turn to the analysis of the bandit setting, where the gradient 
information X7£t( x t) is not available, and thus one cannot run OMD. 
This scenario has been extensively in gradient-free optimization, and 
the basic idea is to play a perturbed version xt of the current point xt- 
This should be done in such a way that, upon observing £t(xt), one can 
build an unbiased estimate gt of Vlt(xt)- Whereas such estimators are 
presented in Chapter El here we restrict our attention to linear losses. 
For this simpler case we consider specialized estimators with optimal 
performances. Given a perturbation scheme, one can run OMD with 
the gradient estimates instead of the true gradients. We call Online 
Stochastic Mirror Descent (OSMD) the resulting algorithm. 
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OSMD (Online Stochastic Mirror Descent): 

Parameters: compact and convex set JC C W 1 , learning rate rj > 0, 

Legendre function F on D D /C. 

Initialize: xi £ argmini ? (x) (note that xi £ fC H D). 

For each round t = 1, 2, . . . , n 

(1) Play a random perturbation x t of xt and observe lt(xt) 

(2) Compute random estimate gt of V£t(xt) 

(3) ™ m = VF* (vF(x t ) - rfgtj 

(4) = argminL» F (7/, w t+ i) 

yelC 



In order to relate this linear bandit strategy to the Exp2 fore- 
caster (|5.ip . it is important to observe that running the Exp2 fore- 
caster over a finite set K, of arms, with exploration distribution \i and 
mixing coefficient 7 > 0, is equivalent to running OSMD over the \KL\- 
dimensional simplex with F(x) = - YlxeK x ^ nx (the negative entropy), 
xt drawn from (1— 7)3:4 +7 fi, and estimated linear loss gt = {(x,(t)) xgJC - 
Indeed, the projection step (4), when F is the negative entropy, corre- 
sponds to the standard normalization of a probability distribution. 

The following theorem establishes a general regret bound for 
OSMD. Note that here the pseudo-regret is defined as 

n n 

7? n = E^^(^)-minE^^(x) . 

i=l X£ t=l 

Note also that we state the theorem for a Legendre function F, but a 
similar result can be obtained under the same assumptions as those of 
Theorem 15.41 



Theorem 5.5 (Pseudo-regret of OSMD). Let K, be a compact 
and convex set, C a set of subdifferentiable functions, and F a Leg- 
endre function defined on T> D fC. If OSMD is run with a loss estimate 
gt such that (|5.3I) is satisfied (with W(x) replaced by gt), and with 
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E[gt | x t ] =V£ t (x t ), then 

- < sn PxeK F( X )-F( Xl ) + 1 £ , _ x 

n 

+ J^E[||a;t -a t || \\g t \\*] 
t=i 

for any norm || • ||. Moreover if the loss is linear, that is £(x) = £ T a 
then 

- < sup^x)-^) + 1 £ , _ 



n 

+ Y,^\\\xt-E[xt\x t ]\\\\g t \\] 



t=i 



Proof. Using Theorem 15.31 one directly obtains: 

j^gjixt-x) < F{X) - F{Xl) + - J2 E Df- ( VF(x f ) - m, VF(x t j) . 
t=i " " t=i 

Moreover since | xj] = V^t(xt), one has: 

n n 

Ej2[Zt(xt) ~ tt{x)) = Ej2(tt&) ~ it{x t ) + £t(x t ) - £t(x) 

t=i 

n n 

< ||x t - x t \\ ||s t ||„ +Ej^ V£ t (x t ) T (x t 

t=i t=\ 

n n 

= e^ \\x t - x t \\ Wmh +~ E ^2gJ{xt - x) 



t=x t=l 

n 



t=l t=l 

which concludes the proof of the first regret bound. The case of a linear 
loss follows very easily from the same computations. □ 

5.4 Online combinatorial optimization 

In this section we consider an interesting special case of online linear 
optimization. In the online combinatorial optimization setting the set of 
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arms is C C {0, l} d and the set of linear loss functions is C = [0, l] d . We 
assume {{v^ = m for all v E C and for some integer m < d. Many inter- 
esting problems fall into this framework, including ranking/selection of 
m items, or path planning. 

Here we focus on the version of the problem with semi-bandit feed- 
back, which is defined as follows: after playing «( £ C, one observes 
(£t(l)vt(l), ■ ■ ■ , £t(d)v t (d)) . Namely, one only observes the coordinates 
of the loss that were active in the arm vt that we chose. This setting 
has thus a much weaker feedback than the full information case, but 
still stronger than the bandit case. Note that the semi-bandit setting 
includes the basic multi-armed bandit problem of ChapterEl which sim- 
ply corresponds to C = {ei, . . . , e^} where ei, . . . , is the canonical 
basis of R d . 

Again, the key to tackle this kind of problem is to select vt at ran- 
dom from some probability distribution pt over C. Note that such a 
probability corresponds to an average point Xt 6 Conv(C). Turning 
the tables, one can view vt as a random perturbation of xt such that 
E[ut | xt] = xt- This suggests a strategy: play OSMD on K. = Conv(C), 
with xt = Vt. Surprisingly, we show that this randomization is enough 
to obtain a good unbiased estimate of the loss, and that it is not nec- 
essary to add further perturbations to xt- Note that Efe | xA = xt by 
definition. We now need to describe how to obtain an unbiased estimate 
of the gradient (which is the loss vector itself, since losses are linear). 
The following simple formula gives an unbiased estimate of the loss: 

4« = ^^ V,E {!,.., d}. (5.5) 

xtW 

Note that this is a valid estimate since it only makes use of 
(£t(l)vt(l), ■ ■ ■ , lt(d)vt(d)) . Moreover, it is unbiased with respect to the 
random drawing of vt from pt- Indeed, 

I x t ] = **Q-E[v t (i) | x t ] =i t (i) ■ 
Xt (i) 
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Using Theorem 15.51 one directly obtains: 
-n <, sup x&!C F(x) - F( Xl ) 

tin S 

V 

1 n ~ 
+ -^E^- (vF(x t ) - rfc, VF(x t )j . (5.6) 

^ t=i 

We show now how to use this bound to obtain concrete performances 
for OSMD using the negative entropy as Legendre function. Later, we 
show that one can improve the results by logarithmic factors, using a 
more subtle Legendre function. 

We start with OSMD and the negative entropy. 



Theorem 5.6 (OSMD with negative entropy). For any set C C 

{0, if OSMD is run on K = Conv(C) with F[x) = £? =1 x { ln Xi - 
Yli=i x ii perturbed points xt such that E[5?t | xt] = x t , and loss esti- 
mates £t, then 

A n d 

' t=l i=i 

In particular, with the estimate (I5.5P and rj = \ ^ In ^, 



Rn < \ 2mdnln — . 
V m 



Proof. First note that: 

1 / 1 \ d 

Fix — Fixi < > xiu) m — — < mm > — = mm — . 

~ ziM ~ m xx{i) J m 

Moreover, straightforward computations give 

Df* CvFfa) - ri£t,VF(x t ) \ = £ x t (i) @{-n£ t {i)) 
^ ' i=i 

2 

where O : x S M i— )• exp(x) — 1 — x. Using that Q(x) < ^- for all x < 0, 
concludes the proof of the first inequality (since £t(i) > 0). The second 



5.4. Online combinatorial optimization 79 

inequality follows from 

x t (i)K[7 t {i) 2 | x t ) = x t (i)^^E[v t {i) \x t )<l 

where we used £t{i) G [0, 1] and v t (i) G {0, 1}. □ 

We now greatly generalize the negative entropy with the following def- 
inition. When used with OSMD, this more general entropy allows us 
to obtain a bound tighter than that of Theorem 15.61 



Definition 5.4. Let uj > 0. A function ip : (— oo,a) — > for some 
a G M U {+00} is called an w-potential if it is convex, continuously 
differentiable, and satisfies 

lim %[)(x) = uj lim ip(x) = +00 

X— > — OO X^rCL 

rw+l 

i>' > / |^ _1 (s)|ds < +00 . 

With a potential we associate the function defined on T> = 
(uj, +cc) d by 

f^)=yi r r i (s)ds . 

1=1 Juj 



We restrict our attention to 0-potentials. A non-zero uj might be used 
to derive high probability regret bounds (instead of pseudo-regret 
bounds). Note that with ip(x) = e x we have that reduces to the 
negative entropy. 



Lemma 5.3. Let tp be a 0-potential. Then F^ is Legendre and for all 
u, v G V* = (—00, a) d such that Uj < Vi for i = 1, . . . , d, 

1 d 

D F *(u,v) < - ^2 ip'{vi){ui - Vi) 2 . 

i=l 
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Proof. It is easy to check that F is a Legendre function. Moreover, 
since VF*(u) = (VF) _1 (u) = (ip(ui), . . . ,ip(ud)) we obtain 

d / ru-i 

D F * (u, y) = }{ / ip(s)ds - (ui - Vi)tp{vi) 

i=l \ Jv i 



From a Taylor expansion, we have 

d 



Df*{u,v) < y~] max -tp'(s)(ui - vi) 
— 7 «e ^ 

Since the function ^ is convex, and Ui < Vi, we have 

max V'( s ) < max{uj, i>j}) < ijj'(vi) 
se[ui,Vi] 

which gives the desired result. □ 



2 



We are now ready to bound the pseudo-regret of OSMD run with an 
arbitrary 0-potential. For a specific choice of the potential we obtain 
an improvement of Theorem 15.61 In particular for m = 1 this result 
gives the log-free bound for the adversarial multi-armed bandit that 
was discussed in Section [3.4. U 



Theorem 5.7 (OSMD with a 0-potential). For any set subset C 
of {0, if OSMD is run on K = Conv(C)jvith F^ defined by a 
0-potential ip, and non-negative loss estimates it, then 



ri 2 ' ^ 

' t=l i=l 

In particular, choosing the 0-potential ip{x) 
mate (|5.5p . and r\ - 



(^)'{x t (i)) 
- (—x)~ q , the esti- 



2 m 1 ~ 2 /i 



9-1 d 1 " 2 /? ' 



R n <q 



q-1 



-mdn . 



With q = 2 this gives 



i? n < 2y / 2mdn . 
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Proof. First note that since T>* = (— oo, a) d and It has non-negative 
coordinates, then (|5.3p is satisfied and thus OSMD is well defined. 

The first inequality trivially follows from (|5,6p . Lemma [5,31 an d the 
fact that i>'{^~ x {s)) = {l p-ly {s y 

Let ip(x) = (—x)~ q . Then we have that ip~ 1 (x) = —x~ l l q and 



£?=!*!(«) = m > 



. In particular, by Holder's inequality, since 



F{x)-F{x x ) < Vx^i) 1 - 1 /" < 

o—l 



-m (g - 1)/9 d 1/9 



i=i 



Moreover, note that (?/> = \ x 1 l ^ q i an d 



i=i 

which ends the proof. 



(<ip-iy(x t (i)) 



x t 



< qY, x t(rf /9 < qm l/q d 1 - 1 



/l 



i=l 



□ 



5.5 Improved regret bounds for bandit feedback 

We go back to the setting of linear losses with bandit feedback consid- 
ered in Section 15.11 Namely, actions belong to a compact and convex 
set K, C M. d , losses belong to a subset £ C M. d , and the loss of playing 
xt G JC at time t is x7^> which is also the feedback received by the 
player. As we proved in Section |5~T1 under the bounded scalar loss as- 
sumption, \x T £\ < 1 for all (x, €) G K, x £, one can obtain a regret bound 
of order dy/n (up to logarithmic factors) for any compact and convex 
set fC. It can be shown that this rate is not improvable in general. How- 
ever, results from Section 15.41 (or from Chapter [3J) show that for the 
simplex, one can obtain a regret bound of order y/dn, and we showed in 
Chapter [3] that this rate is also unimprovable. The problem of obtain- 
ing a charaterization of the sets for which such improved regret bounds 
are possible is an open problem. Improved rates can be obtained for 
another convex body: the Euclidean ball. We now describe a strategy 
that attains a pseudo-regret of order y/dn (up to a logarithmic factor) 
in this case. The strategy is based on OSMD with a carefully chosen 
Legendre function. 
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In the following, let || • || be the Euclidean norm. We consider the 
online linear optimization problem with bandit feedback on the Eu- 
clidean unit ball 1C = {x £ R d : ||x[| < 1}. We perform the following 
perturbation of a point xt in the interior of fC, 

~ = { x t / \\x t \\ if £ t = 1, 
\ Et ej t otherwise 

where £t is a Bernoulli random variable of parameter ||xt||, It is drawn 
uniformly at random in {1, . . . ,d}, and e% is a Rademacher random 
variable with parameter 5. 

It is easy to check that this perturbation is unbiased, in the sense 
that E[xt I xA = xt- An unbiased estimate of the loss vector is given 
by 

e t = d(l-£ t ) - .. x t . (5.7) 

1 - \\ x t\\ 

Again, it is easy to check that | xt\ = Xf. We are now ready 

to prove the following result, showing that OSMD with a suitable F 
achieves a pseudo-regret of order V dn In n on the Euclidean ball. 



Theorem 5.8 (OSMD for the Euclidean ball). Let JC = C = 

{x £ M. d : \\x\\ < 1} define an online linear optimization prob- 
lem with bandit feedback. If OSMD is run on KJ = (1 — j)IC with 
F(x) = — ln(l — ||x||) — ||x|| and the estimate (15. 7p . then for any r/ > 
such that r]d < 2, 



In ~ 1 n 
R„ £ V , + +7?^e[( 

t=i 



1 



V 



In particular, with 7 = and r\ = y^^> 

nlnn 



(5.8) 



(5.9) 



Proof. First, it is clear that by playing on K' = (1 — j)JC instead of K, 
OSMD incurs an extra jn regret. Second, note that F is stricly convex 



5.5. Improved regret bounds for bandit feedback 83 



(it is the composition of a convex and nondecreasing function with the 
Euclidean norm) and 

V-F(x) = . (5.10) 

1 — ||x|| 

In particular, F is Legendre on the open unit ball T> = {x € M. d : \\x\\ < 
1}, and one has T>* = Mr. Hence (|5.3|) is always satisfied, and OSMD 
is well defined. Now the regret with respect to fC' can be bounded as 
follows: using Theorem 15.51 and the unbiasedness of x% and 1% we get 

SUP ^ F(X) " F(X1) +-J2 EL> f* (™M - r,i t , VF(x t , 

The first term is clearly bounded by ^ In - (since x\ = 0). For the 
second term, we need to do a few computations. The first one follows 
from (|5,10p ). the others follow from simple algebra 

VF» " 



1 + ||n|| 
F*(u) = -ln(l + ||u||) + ||u|| 

Dp* (u, v) = rj — rr ( lltill — \\v\\ + llull • \\v\\ — V T U 

1 + \\v\\ \ 
-(1 + [H|) In ( 1+ "" 



l + \\v\ 

Let @(u,v) such that Dp*(u,v) = ctct &(u, v). First note that 

1 



l + ||VF(x t )|| 



Thus, in order to prove (|5.8p it remains to show that @(u,v) < 
\\u — v || 2 , for u = VF(xt) — T]lt and v = VF(xt). In fact, we prove 
that this inequality holds as soon as > —\- This is the case 

for the pair (u, v) under consideration, since by the triangle inequality, 
equations (|5.7p and (|5.1ip . and the assumption on r/, 

N-IHI>_jSL>_ I7d > 1 



1 + \\v\\ - 1 + \\v\\ ~ 2 



84 Linear bandits 



Now using that ln(l + x) > x — x 2 for all x > — \, we obtain that for 
u, v such that ^joy > — |, 

e(u,t,)< (IHI "jy + NI-H-^ 

1 + ||l>|| 

— (Il n ll ~~ IMI) 2 + IMI ■ IMI ~~ u 

ii n2 n n2 n n n n T 

= ||u|| — I — II f II — INr IMI — v u 

II ii 2 i rj T n n n n T 

= || it — v\\ + 2v u — \\u\\ ■ \\v\\ — v u 

II l|2 
< \\U — v\\ 



which concludes the proof of (I5.8p , For the proof of (I5.9P it suffices to 
note that 



E 



i- INI \\tt 



:i-indE 



= d\\l t \ 

< d 



1 - \\ x t\ 



d 2 



ti d (i-Ni 



and perform with straightforward computations. 



□ 



5.6 Refinements and bibliographic remarks 

Online convex optimizat io n in the full information setting was 
introduced by IZinkevichl [2003]. Online linear optimizatio n wit h 
bandit feedback was pion eered in Awerbuch and Kleinber el 2004 ] , 
McMahan and Blum 20041 ] . For this problem, Dani et al.l [2008a ] were 
the first to obtain optimal O^y/n) bounds in terms of the number n of 
rounds. This was done using the Exp2 strategy with an exploration uni- 
form over a barycentric spanner for fC . The exploration part was first 
improved by ICesa-Bianchi and Lugosil 20121 ] for combinatorial sets fC. 
Finally, t he optimal explorati on based on John's theorem was intro- 



2012al |. Theorem 15 .21 is extracted from this last 



duced by lBubeck et al. 
paper. 

Simultaneously with the line of research on Exp2, algorithms based 
on Online Mirror Descent were also investigate d. Mirror Descen t 
was originally introduced in the seminal work of iNemirovskil 19791 ] , 
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Nemirovski and Yudinl 1983] as a standard (offline) convex optimiza- 



in the online learning community, see 


Herbster and Warmuth 


19981. 


Grove et al. 


2001]. 


Kivinen and Warmuth 2001], 


Shalev-Shwartz 



20071 ] . The connection between existing online learning algorithms 
(such as Exponential weights or Online Gradient Descent) and M irror 
Des cent was first m ade explicit inlCesa-Bianchi and Lugosil 20061 ] — see 
also lRakhlinl 20091 ] and lHazanl 20111 ]. Earlier app lications of Mir r or De - 



scent in the learning community can be found in lJuditsky et al.l 20051 ] . 
The first application of Mirror De scent to online linear o ptimization 
with bandit feedback was given by lAbernethy et al.1 20081 ] . In this pi- 
oneering paper, the authors describe the first computationally efficient 
strategy (i.e., with complexity polynomial in d) with 0{y/n) regret. 
The main idea is to use Mirror Descent with a self-concordant barrier 
F for the set /C. Unfortunately, the drawback is a suboptimal depen- 
dency on d in the regret. More precisely, they obtain a OidP^fn) regret 
under the bounded scalar loss assumption, while Exp2 with John's ex- 
ploration attains 0(dy/n). However, Mirror Descent can also deliver 
optimal regret bounds i n the bandit case, as we showed in Section 15. 5| 



which is extracted from Bubecketal 



2012al ]. 



The presentation of th e Online Mirr or Descent algorithm in 
Section 15.21 is inspired by Bubcck 201111 . The defin i tion of Leg- 
endre functions comes from Cesa-Bianchi and Lugosil . 120061 . Chap- 
ter 11] — further developments on convex analysis can be found i n 
Hiriart-Urrutv and Lemarechaj 2001 1. Bovd and Vandenberghe 2004]. 



is taken from Audibert et al 



201111. bu t the 



Theorem 

technique goes back at least to iBen-Tal and Nemirovski 
proof of Theorem [5 
tion 15.31 is in spired by gradient-fr e e opt i mization, a topic extensively 
studied since iRobbins and Monro! 195 ill. iKiefer and Wolfowitz 1952] 



proof 
1999H . The 
2012j . Sec- 



-see 



Nemirovski et al 



20091 ] 



Conn et al 



20091 ] . 



Nesterov 



2Q1JJ, 



Bach and Moulinesl 2011 ] for recent accounts on this theory. Alterna- 
tive views have been proposed on the Online Mirror Descent strategy. 
In particular, it is equivalent to a Fol l ow Th e Regularized Leader, and 



to proximal algorithms, see iRakhlinl 20091 ] . This viewpoint was pio- 
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neered by lBeck and Teboulld 20031 ] — see also iBartok et al.1 20111 ] for 



more details. Finally, a notion of universality of Online Mirror Desce nt 



in the online prediction setting was proposed by ISrebro et al.l [2011 ]. 



The online combinatorial o ptimization problem studie d in Sec- 



tion 15.41 was introduced by 

full information setting. Several works h ave studied 



Kalai and Vempalal 20051 ] for the 

this prob- 



lem f o r specific sets C, see in pa r ticular Takimo to and Warmuth 



20031, 



Hazan et al 



Warmuth and Kuzmin 
2Qi4 " 



2008] 



Koolen et al 



Helmbold and Warmuth 



20101], IWarmuth et al 



2009], 



2Q11I], 



Cesa-Bianchi and Lugosi 20121] . The semi-bandit feedback wa s stud 



ied in the s e ries o f papers iGvorgy et a .1 [20071 ] . iKale et al.1 [20101 ] . 



Uchiva et all |201ol ]. lAudibert et al.l [20111 ] . The presentation adopted 



in this section is based o n the last paper. QSMD with negative en- 
tropy was first studied by iHelmbold and Warmuth! [20091 ] for the full 
information setting and fo r a sp ecific set C. It was then studied more 
generally in iKoolen et all |201ol ] f or any set C. The gen eralization to 



semi-bandit feedback was done by lAudibert et al.l [201 llj. QSMD with 



a Leg endre derived from a potential was introduced by lAudibert et al 



20111 ] . In t he case of the simplex, this s trategy corresponds to the INF 



=gy of lAnchhert, and Tfr.hed J & discussed in Section EOT 
Online linear optimiz ation is s t ill far from being completely under- 

Chapter 9] for a list of open 



Bubeck 



2011 



stood. For instance, see 
problems. In this section we also omitted a few important topics re- 
lated to online linear optimization. We briefly review some of them 
below. 



5.6.1 Lower bounds 



Under the bounded scalar loss assumption, it was proved bv lDani et al 
2008^ that for K = [-1, l] d the minimax regret in the full information 



setting is at least of order V dn, while under bandit feedback it is of 
order d\J~n. In both cases Exp2 is matching these lower bounds (using 
John's exploration in the bandit case). 

In the combinato rial setting, where /C C {0, l} d and C = [0, l] 01 , 
Audibert et al.l 20111 ] show that the minimax regret in the full informa- 
tion and semi-bandit cases is at least of order d^fn, while in the bandit 
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case it is of order dfl^^fn. OSMD with the negative entropy matches 
the bounds in the full information and semi-bandit cases. However, in 
the bandit case the best known bound is obtained by Exp2 (with John's 
expl oration) and gives a reg ret of order d 2 ^fn. It is important to remark 
that lAudibert et al.l 20111 ] show that Exp2 is a provably suboptimal 
strategy in the combinatorial setting. 

Finally, lower bounds for the full info rmation case, and fo r a few 



specific sets K, of interest, were derived by iKoolen et al.l [20101 ] . 



5.6.2 High probability bounds 

In this chapter we focused on the pseudo-regret R n . However, just like 
in Chapter EJ a much more important and interesting statement con- 
cerns high probability bo unds for the regret R „ . Partial results in this 
directi on can be found inlBartlett et al.1 [20081 ] for the Exp2 strategy, 
and in lAbernethv and Rakhlinl [20091 ] for the OSMD algorithm. 



5.6.3 Stochastic online linear optimization 

Similarly to the stochastic bandit case (see Chapter [2]) , a natural re- 
striction to consider for the adversary is that the sequence of losses 
li , £? , . . . is an i.i.d. sequence. This st ochastic set t ing wa s introduced by 



Auer 



20021 ]. and further studied by iDani et al.l [2008bl ]. In particular, 



in the latter paper it was proved that regrets logarithmic in n and poly- 
nomial in d are possible, as lo ng as K, is a polytope. Recent progress o n 
this problem can be found in Rusm evichientong and Tsitsiklisl 20101 ]. 
Filippi et all [201o| . lAbbasi-Yadkori et al.1 |201l| ]! 
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We now extend the analysis of the previous chapter to the following 
scenario: arms are still points in a convex set /C C K rf , but now losses 
are not necessarily linear functions of the arms. More precisely the 
adversary selects loss functions from some set C of real- valued functions 
defined on /C. The pseudo-regret is then defined as: 

n n 

R n =E^2£t(x t ) - minE^(x). 
t=l xe t=i 

This modification has important consequences. For instance with 
strictly convex losses one has to do local perturbations in order to 
estimate the loss gradient, this is in contrast to the global perturba- 
tions studied in the previous chapter. In agreement with the setting of 
Chapter El we initially focus on the nonstochastic setting, where the 
forecaster faces an unknown sequence of convex Lipschitz and differen- 
tiable losses (in the nonlinear case the regret scales with the Lipschitz 
constant of losses). Problems of this kind can be viewed as dynamic 
variants of convex optimization problems, in which the convex func- 
tion to optimize evolves over time. The bandit constraint can be sim- 
ply interpreted as the impossibility of computing gradients (because, 
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for instance, we do not have a explicit representation of the function, 
but it can only be accessed by querying for values at desired points). 

We look at two feedback models. In the first one, at each step the 
forecaster evaluates the loss function at two points: the played point 
plus an additional point of its choice. In the second one, only the value 
of the loss evaluated at the played point is made available to the fore- 
caster. We show that while the two-points model allows for a 0(^/n) 
bound on pseudo-regret, in the one-point model a pseudo-regret bound 



Section 16.31 where, similarly to Chapter [21 we assume that each play 
of an arm returns a stochastic loss with fixed but unknown mean. Un- 
like the nonstochastic case, the mean loss function is assumed to be 
Lipschitz and unimodal, but not necessarily convex. For keeping things 
simple, the stochastic setting is studied in 1-dimensional case, when 
arms are points in the unit interval. For this case we show a bound on 
the pseudo-regret of O (y/n(\og n)) . 



6.1 Two-points bandit feedback 

We start by analyzing the nonstochastic case in the two-point feedback 
model: at each time step t, the forecaster observes the value of a convex 
and differentiable loss function it at the played point xt and at an extra 
point x' t of its choice. If the second point is chosen at random in a 
neighborhood of the first one, one can use it to compute an estimate 
of the gradient of it at xt- Hence, running OSMD on the estimated 
gradients we obtain a regret bound controlled by the second moments 
of these estimates. The algorithm we present in this section follows this 
intuition, although — for technical reasons — the gradient is estimated 
at a point which is close but distinct from the point actually played. 
We focus our analysis on OSMD with Legendre function F = 

1 1 1 1 1 2 

2 || • || , where ||-|| is the Euclidean norm. The resulting strategy, On- 
line Stochastic Gradient Descent (OSGD) is sketched below here. 




achieved. The stochastic setting is investigated in 
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OSGD (Online Stochastic Gradient Descent): 

Parameters: Closed and convex set K, C M. d , learning rate r] > 0. 
Initialize: x\ = (0, . . . , 0). 

For each round t = 1, 2, . . . , n 

(1) Observe stochastic estimate gt(xt) of V£t(xt); 

( 2 ) x t+i =xt~ fl9t{xt); 

(3) x m = argmin II y - x' t+1 \\; 

y&K 

We now introduce our main technical tool: the two-point gradient 
estimate. The two points on which the loss value is queried at time 
t are denoted by and X t _ . OSGD always plays one of these two 
points at random. 

Let B = {x G R d : \\x\\ < l} be the unit ball in R d and S = 
|xGlR d : || x || = 1} be the unit sphere. Fix 5 > and introduce the 
notations Xf = xt + 5S and X t ~ = x t — 5S, where xt G /C and S is a 
random variable with uniform distribution in 8. Then, for any convex 
loss it, the two-point gradient estimate g t is defined by 

gt(xt) = y 6 (it(X+)-i t (Xn)S . (6.1) 

In order to compute the expectation of gt , first note that by symmetry 

Eg t (x) = ^E[e t (x + SS)S] . 

In order to compute the expectation in the right-hand side we need the 
following preliminary lemma. 



Lemma 6.1. For any differentiable function I : R rf — >• R 

V / £(x + 5b)db= I £(x + 5s)sda(s) 
where a is the unnormalized spherical measure. 



Proof. The proof of this result is an easy consequence of the Divergence 
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Theorem, 

V f £(x + 5b) db= I V£(x + 5b) db 
Jb Jm 

= / -£(x + 5s)s dcr(s) 



- [ £(x + 5s)sda(s) . 
o Js 



□ 



We are now fully equipped to compute the expectation of gt- 

Lemma 6.2. If B is a random variable with uniform distribution in B 
and S is a random variable with uniform distribution in S, then for all 
differentiable functions it : M d — > R, 

^E[£(x + 5S)S] = VK£(x + 5B) . 

Proof. First consider the easy one-dimensional case. Namely, K, = [a, b\ 
for some reals a < b. Note that, in this case, S is uniform in {—1, +1} 
whereas B is uniform in [— 1,+1]. Then 

1 f S ti w L ( x + 5 ) ~ L ( x ~ 6 ) 



E£(x + 5B) = ^- [ £(x + e) de 
25 J-s 



25 

by the fundamental theorem of calculus, where L is the antiderivative 
of £ satisfying L' = £. This gives 

A E< (* -MB) = ^ + < > -/<*-*>. 
dx 25 

On the other hand, 

Hence ^E[£(x + 5S)S] = ^(x + and the 1-dimensional case is 
established. Note that the equivalence we just proved relates an inte- 
gral over the unit sphere 8 to an integral over the unit ball B. In d 
dimensions, Lemma [6.11 delivers the corresponding generalized identity 

\ I l(x + 5s)s da(s) = V [ £(x + 5b) db . 
5 Js JB 
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Now, since Vol(S) = dVol(B) we immediately obtain 




concluding the proof. 



□ 



We have thus established Eg t (x) = VE£ t (x + SB), showing that gt 
provides an unbiased estimate of a smoothed version £t(x) = E£t(x + 
SB) of the loss function £f. 

We can measure how well £t approximates £t by exploiting the Lip- 
schitz assumption, 



The next lemma relates the regret under the losses £t to the regret under 
their smoothed versions £t- An additional issue taken into account by 
the lemma is that OSGD might play a point close to the boundary of 
the set JC. In this case the perturbed point on which the gradient is 
estimated could potentially be outside of JC. In order to prevent this 
from happening we need to run OSGD on a shrunken set (1 — £)/C. 



Lemma 6.3. Let JC C R d be a convex set such that JC C RM for some 
R > 0, and fix < £ < 1. For any sequence £\,£2, ■ ■ ■ of G-Lipschitz 
differentiate and convex losses, and for any sequence x\,X2, - ■ ■ € (1 — 
OJC C R d , the following holds 



£ t {x) - £ t (x)\ = \£ t (x) -E£ t {x + SB)\ 



< E\£ t (x) - £ t (x + SB)\ 

< 5GE\\B\\ 
_< SG . 



(6.2) 



-J2(lt(X+)+£ t (X t -))-J2tt(x) 



n n 



t=l t=l 



n n 



< ^2 tt(xt) - J2 £ t(^ - + 35Gn + t GR 



t=l t=l 



for all realizations of the random process (Xf , X t ) 
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Proof. Using the Lipschitzness of it an d (16, 2D we obtain 
l(£ t (X+)+£ t (Xi))+£ t {(l-Ox) 

< I faxt) + SG \\S\\ + l t {x t ) + 5G + £ t ((l - Ox) + SG 

<lt(x t ) + £t(x) +26G + CGR 

<I t (x t )+£t{x) +35G + ZGR . 

In the second step we used — < £G \\x\\ < £GR which results 
from the Lipschitzness of it and the assumption /C C RM. □ 

Next, we show that the second moment of gt can be controlled by 
exploiting the Lipschitzness of it- In particular, 

\\gt(x)\\ = y 5 H x + 5S ) - - 5S )\ w s w ^ lr l|2<5,s|1 = Gd ■ 

We are now ready to prove the main result of this section. Namely, 
that the pseudo-regret of OSGD run using the gradient estimate (|6.ip 
is of order yjn. We assume that the point X t played by OSGD at each 
time t is randomly drawn between the two points and X^ where 
the loss function is queried. 

Theorem 6.1 (Regret of OSGD with two-points feedback). 

Let K, C M. d be a closed convex set such that rl C K, C RM for some 
r, R > 0. Let C be a set of G-Lipschitz differentiate and convex losses. 
Fix 5 > and assume OSGD is run on (l — ^)/C with learning rate 
i] > and gradient estimates (16. ip . 

gt(xt) = y 6 {it(Xt)-it(xn)St 

where Si, S2, •■■ £ S are independent. For each £ = 1,2,... let X t be 
drawn at random between Xf and X~[ . Then the following holds 

Rn < — + v(Gd) 2 n + 6 ( 3 + Gn . 
77 \ r J 

Moreover, if 77 = „^ y= then for 5 — > we have that 

' ' GDy/n 

R n < 2RGd^ . 
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Proof. First of all, we must check that the points = xt + 5S and 
Xf = xt — 5S on which i t is queried belong to fC. To see this, recall 
that x t G (l - £)/C. Now, setting a = ~ we have that X t + ,X t ~ G (1 — 
a)tC + ar S. Since r § C /C and /C is convex, we obtain (1 — ot)K + ar § C 
(1 — a)/C + q/C C /C. Hence, using Lemma [6731 with the choice £ = | we 
immediately get that for all x G /C, 

n n 

^E(^(X t )|X+,Xr)-^4(x) 
t=l <=1 

1 n n 

<2EW)+W-E^) 

*=i *=i 

<E^)-E^(( 1 -f) a! ) + *( 3+ 7) Gn - 
*=i t=i v y 

Since we already related the loss of Xt to the loss of xt, we can now 
apply Theorem l5.5l in the special case of xt = x t and with the sequence 
of losses {it). This gives 

n n n 

t=i t=i ' " t=i 

< — + ri(Gd) 2 n 
V 

where we overapproximated ||(l — ^)/C|| < \\JC\\ = R. This concludes 
the proof. □ 

6.2 One-point bandit feedback 

Building on the analysis of the previous section, it is not hard to show 
that the pseudo-regret can be bounded even when the loss function at 
each time t is queried in only one point. However, we pay this reduced 
bandit feedback with a worse rate of n 3 / 4 in the pseudo-regret bound. 
It is not known if this rate is optimal, or if it is possible to get a y/n 
regret as in the two-points setting. 

The one-point estimate at time t is defined by 

g t {x) = -Mx + 5S)S (6.3) 
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where 5 is drawn at random from S. Obviously, Lemma 16.21 can be 
applied to get Kgt(x) = V£t(x) where, we recall, £t(x) = El t (x + SB). 
Differences with the two-point case arise when we bound the second 
moment of this new gt- Indeed, if x + 8S G JC and the maximum value 
of each l t in JC is bounded by L, then 

\\g t { X )\\ = i\^ X + 8S)\\\S\\< d j- . 

Note the inverse dependence on 5. This dependence plays a key role in 
the final bound, as the next result shows. 



Theorem 6.2 (Regret of OSGD with one-point feedback). 

Let JC C M. d be a closed convex set such that rE C JC C RM for 
some r, R > 0. Let £ be a set of G-Lipschitz differentiable and convex 
losses, uniformly bounded by L (that is \\£\\oo < L,\/£ G C). Fix 5 > 
and assume OSGD is run on (l — -)/C with learning rate rj > and 
gradient estimates (|6.3p . 

ft(*t) = -Jt(Xt)S t 
o 

where X t = Xt + 5 St and S\,S2,-" £ § are independent. Then the 
following holds 

- R 2 (dL) 2 / fl\ 

Moreover, if 

1 / ifeiL _ 1 / # 3 

(2^\/ (3 + #)G r?_ (2n)3/4y d L(3 + f)G 

then 

i?n < 4n^ 4 ^RdL(3 + |)G . 



Proof. The proof follows along the same lines as the proof of Theo- 
rem EH Indeed, we can show that the points X t = x t + <5S on which ^ 
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is queried belong to /C. Then, using an easy modification of LemmaE 
we get that for all x £ /C, 



J2E(£ t (X t )\X+,X t -)-J2W 
t=i t=i 

<£4(a*)-££((l -*)*)+* 3 + - Gn. 
t=i t=i v ' 

Applying Theorem 15.51 as in the proof of Theorem 16.11 gives 

n n „2 n 



t=l t=l ^ t=l 



R 2 (dL) 2 



□ 



6.3 Nonlinear stochastic bandits 

We conclude with a simple example of nonlinear bandits in the stochas- 
tic setting. Unlike the gain-based analysis of stochastic bandits of Chap- 
ter [21 here we keep in with the convention used throughout this chapter 
and work exclusively with losses. 

We consider a simple unidimensional setting where arms are points 
in the unit interval [0, 1]. If at time t a point x t S [0, 1] is played, the 
loss is the realization of an independent random variable Yt £ [0, 1] 
with expected value E[Y t |x t ] = //(xj), where fi : [0,1] — > [0,1] is a 
fixed but unknown mean loss function. Similarly to Chapter [21 here 
the pseudo-regret after n plays of a given strategy can be rewritten as 

n 

R n = y n{xt) — n max /j,(x) 
xe[o,i] 

where x±, . . . ,x n £ [0, 1] denote the points played by the strategy. 

Throughout this section, we assume that ii : [0, 1] — > [0, 1] is uni- 
modal, but not necessarily convex. This means there exist a unique 
x* = argmin^gjQ ^j fj,(x) such that fj,(x) is monotone decreasing for 
x G [0, a;*] and monotone increasing for x 6 [#*,!]• For example, if 
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\i can be written as fj,(x) = x f(x) where / : [0, 1] — > [0, 1] is differen- 
tial) le, monotone decreasing, and such that x f'{x) is strictly decreasing 
with /(0) > 0, then \i is unimodal. 

The bandit strat egy we analyz e in this section is based on the golden 



section search due to lKieferl [19531 ] . which is a general algorithm for find- 
ing the extremum of a unimodal function. Similarly to binary search, 
each step of golden section search narrows the interval in which the ex- 
tremum is found by querying the function value at certain points that 
are chosen depending on the outcome of previous queries. Each query 
shrinks the interval by a factor of = 0.618 ... , where ip = | (l + v5j 
is the golden ratio. 

In our case, queries (i.e., plays) at x return a perturbed version of 
n(x). Since ri is bounded, Hoeffding bounds ensure that we can find the 
minimum of fi by repeatedly querying each point x requested by the 
golden search algorithm. However, in order to have a lower bound on 
the accuracy with which each \i needs to be estimated, we must assume 
the following condition: there exists Cl > such that 



|/i(x) — > Cl\x — x'\ (6-4) 



for each x,x' that belong either to [0, x* — 1/Cl] or to [x* + I/Cl, 1]. 

Finally, irrespective to the uncertainty in the evaluation of /i, in 
order to bound the regret incurred by golden section search we need 
a Lipschitz condition on [i. Namely, there exists Ch > such that 
\ri(x) - n(x')\ < C H \x-x'\ for all x,x' £ [0,1]. 

We are now ready to introduce our stochastic version of the golden 
section search algorithm. 



98 Nonlinear bandits 



SGS (Stochastic Golden Search): 

Parameters: £i,£2, m ■ ■ > 0. 

Initialize: xa = xb = ^ xc = 1. 

For each stage s = 1, . . . , n 

. , , f SB-^^B-Ii) X B -I j4 >Xc-^B 

(!) Le t »b = S ,i, x - 

[ + ~i\XC ~ xb) otherwise 

and rename points xb,x' b so that < x# < X ' B < xc- 

(2) Play each point in {xa,xb,x' b ,xc} for ln(6n) times 
and let x be the point with lowest total loss in this stage. 

(3) If x € {xa,xb} then eliminate interval (x' B ,xc] and let 
x c = x' B , 

(4) else eliminate interval [xa,xb) and let x^ = SB- 



Recall that golden section search proceeds as follows: given three 
queried points xa < xb < xq where the distance of xb to the other 
two points is in the golden ratio (xb might be closer to xa or to xc 
depending on past queries), the next point x' B is queried in the largest 
interval between xb — xa and xc — xb so that the distance of x' B to 
the extrema of that largest interval is in the golden ratio. Assume the 
resulting ordering is xa < xb < x' B < xc- Then we drop either [xa, xb) 
or {x' B ,xc\ according to whether the smallest value of (i is found in, 
respectively, {x' B ,xc} or {x' B ,xc}- The remaining triplet is such that 
the distance of the middle point to the other two is again in the golden 
ratio. 

Using elementary algebraic identities for <p, one can show that set- 
ting xc — xa = 1 the following equalities hold at any step of SGS: 

1 , 1 ,1 . . 

xb-xa = —r x b -xb = ^t xc-x B = ^- (6.5) 
{p z ip^ (f z 

Since either x# — xa or xq — x' B are eliminated at each stage, at each 
stage SGS shrinks the search interval by a factor of 1 — (p~ 2 = ^. 

Let [xa,s,xb, s ] be the search interval at the beginning of stage s + 1, 
where xa,o = and xc*,o = 1- 
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Lemma 6.4. If e s = Cl<P~( s+3 ' then 

x* [xa,s,xc,s\] < - 



n 



holds uniformly over all stages s > 1. 



Proof. Once the interval containing x* is eliminated it is never recov- 
ered, thus we have 



^x* [xA, s ,x c ,s}j < P(z* [»A,s-i,a;c,s-i] 

+ P(x* [xa, s ,»c,s] a;* £ [^A.s-i, xc,a-l] ) • (6-6) 



Let X s = {xa,s-i,xb,s-i,x' Bs _ 1 ,xc, s -i} where z.b )S -i < x' Bs _ 1 are 
computed in step 1 of stage s. Let j2 s (x) be the sample loss of point 
x G X s in stage s and let x s = argmin^^ Since at stage s every 
point in X s is played \ ln(6n) timeqj, Hoeffding bounds imply that 
\/j,(x) — j2 s (x)\ < |e s with probability at least 1 — i for all x £ X s 
simultaneously. Let 

x* = argmin li(x) . 

xeX s 

Now assume x* E [xa,s-i,xb,s-i\- Then x* [xa,s>%C,s] implies 
%(xb',s-i) < V-(xb,s-i) or Jl a (xc,s-i) < V>{ x B,s-l)- Similarly, assume 
x* £ [xb',s-i,x c , s -i]- Then x* [x A ,s,xc, s ] implies %(xa,s-i) < 
V'{ x B',s-i) or V's{xb,s-i) < V-(xb' ,s-i)- I n both cases, we need to com- 
pare three values of ri on the same side with respect to x*. (When 
x* 6 [xb ]S _i,xb' )S _i] we always have x* £ [xa,si%C,s]-) Hence, we can 
apply our assumption involving Cl- More precisely, (|6.5|) implies that 
after s stages the search interval has size ip~ s and min{x b >s —xa,s, x ' b s — 
xb,s,xc,s — x' Bs } = ip~( s+3 ^ . Hence, introducing 

A s = mm{\ri(x B ,s) ~ h(xa,s)\, \n(x' B J - ri(x B , s )\, \ri(x c , s ) ~ K x 'b,s)\} 
we have 

A s > C L min{x BiS - x A , s , x' B s - xb, s ,x c ,s - x' B J > Cl93~ (s+3) = e s . 



1 For simplicity, we assume these numbers are integers. 
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Let T s = ln(6n) the length of stage s. We can write 

f(x* g" [XA )S ,XC, S ] X* G [^s-l.IC.s-l]) = P(Ma(^s) < K x *s)) 

< Yl Hn s (x)<fi(x* s )) 

xex a \{x*} 

< Yl ( F (*) < ^ -^) +F (rt x *>) < ^) - t 

xex s \{x*} V V J V 

< 6e -T a ^/8 

< 6e -r^/8<I . 

n 

Substituting this in (I6.6P and recurring gives the desired result. □ 



Theorem 6.3 (Regret of SGS). For any unimodal and Ch- 
Lipschitz mean loss function fi : [0, 1] — > [0, 1] that satisfies (|6.4p . if the 
SGS algorithm is run with e s = CLf~^ s+3 ^ then 



Rn < 8p 6 ln(6n) 



2(p 



_^ 1 + C 2 n+ _ log 2 (l + c 2 n) 



Proof. We start by decomposing the pseudo-regret as follows, 

Rn < T s I min fj,(x) - fi(x*) ) + fj,(x t ) - T s min fi{x) . 

s=i v 7 s=i \teT s / 

Using the Lipschitz assumption 

max - < C H \xc, a ~ %A,s\ 

and recalling that \xc, s — %A,s\ < <P~ S -, we bound the first term in this 
decomposition as follows 

S , V 

I min ^(x) - fi(x*) ) < T 8 C H \xc,8 - x A ,s\ F[x* G [x4, s ,lc,s]) 

+ T s C H P(x* [x A , s ,x CiS ] 
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The second term is controlled similarly, 
s / \ 



^ ( V&t) - T s mm (i(x) j < T s C H \xc,s - x A , s \ < 
8=1 \teT s xe s J 

Hence we get an easy expression for the regret, 

S 



T s Ch 



B„ <C h ^2t s { — + S 

8=1 



ip° n 
s 



<§J 8 /ln(6n)£^(A + £) . (6 . 7 ) 



y L 8=1 

We now compute an upper bound on the number S of stages, 



^ln(6n)j:^<^ln(6^^- 
s=1 ^ s =l °L 



1 



Solving for n and overapproximating we get 

S< l -\o^(l + C 2 L n) . 
Therefore, the sum in (J6.7P is bounded as follows 

r 



2V^ + 5 2 <^-/ + S 2 

^— ' 09 — 1 



8=1 



2c? 



<^lV 1 + C ' n+ 4 1 <( 1 + ^ n ) • 
Substituting the above in (16. 7p concludes the proof. □ 

6.4 Bibliographic remarks 

Gradient-free methods for stochastic approximation have been studied 
for a long time — see the bibliographic remarks at the end of Chap- 
ter [5] for some references. However, relatively few results provide regret 
bounds. The approach presented in this chapter for online convex opti- 
miza tion with bandit feedback was pioneered by Flaxman et all 12005 ] 
and [Kleinberg L 20041. The improved rate for the two-points model was 
later shown in lAgarwal et al" 2010b ]. 
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While the results for nonlinear bandits in the adversarial model 
are still scarse, there is a far richer body of work in the stochastic 
model. The result based on golden sectio n search presented in Sec- 



tion 



is due to 



Yu and Mannorl . 120111 ] . It represents only a tiny 



portion of the known results in the stochastic model. In the general 
case of Lipschitz mean-payoff on a compact subset of M. d , it can be 
shown that the minimax regret is Q(n d + 2 \ . Thus the rate rapidly dete- 
riorates as the dimension increases, a phen omenon known a s the c urse 
of dimensiona l ity. Ho wever it was shown in Kleinberg et all [2008] and 



Bubeck et all l2009bl ] that under a generalized version of equation (JIT 



it is possible to circumvent the curse of dimensionality and obtain a 
regret of 0(y/n). This result builds upon and generalizes a sequence of 
works t hat include the d iscre tization approach (for the 1-dimensional 
cas e) of Kleinbergl . 2004 ] and Auer et al. . 2007 ]. as well as the method 
of |Copel . 120091 ] based on the Kiefer-Wolfowitz procedure (a classical 
method of st ochastic optimiz a tion) . The key new algorithmic id ea in- 



troduced in [Kleinberg et all 120081 ] and [Bubeck et all l2009bl ] is to 
adaptively partition the set of actions in order to exploit the smooth- 
ness of t he mean-p ayof f f uncti on around its maximum. We refer the 
reader to iBubeck et al .1 2011cf | for the details of this result (which is 
much more general than what we briefly outlined, in particular it ap- 
plies for metric spaces, or even more general action sets), as well as a 
more precise historical account. 

Another di rection for non l inear s tochastic bandits was recently in- 



vestigated in Agarwal et all l2011bl ]. In this work the authors study 



the case of a convex mean loss function, and they show how to com- 
bine the zeroth-order optimization method of Nemirovski and Yudinl . 



19831 ] with a "center point device" to obtain a regret of 0{y/n). 



A more general versio n of nonlinear stochastic bandit was also stud- 
ied in lAmin et al.l 201 1[ |. In this paper the authors assume that the 
mean-payoff function lies in some known set of functions J- . They de- 
fine a notion of complexity for the class J-, the haystack dimension 
HD(J r ), and they show that the worst case regret in T is lower bounded 
by HD(J r ). Unfortunately their upper bound does not match the lower 
bound, and the authors suggest that the definition of the haystack di- 
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mension should be modified in order to obtain matching upper and 
lower bound. 

Finally, a re l ated prob lem in a Bayesian setti ng was studied 



m 



brimvas et all [20ld ] and iGriinewalder et alJ |20ld |. where it is as- 



sumed that the payoff functions are drawn from Gaussian processes. 



7 



Variants 



In the previous chapters we explored a few fundamental variations 
around the basic multi-armed bandit problem. In both the stochastic 
and adversarial frameworks, these variants basically revolved around a 
single principle: by adding constraints on the losses (or rewards), it is 
possible to compete against larger sets of arms. While this is indeed a 
fundamental axis in the space of bandit problems, it is important to 
realize that there are many other directions. Indeed, we might sketch 
a "bandit space" spanning the following coordinates: 

• Evolution of payoffs over time: stochastic, adversarial, 
Markovian, . . . 

• Structure of payoff functions: linear, Lipschitz, Gaussian pro- 
cess, . . . 

• Feedback structure: full information, bandit, semi-bandit, 
partial monitoring, . . . 

• Context structure (if any). 

• Notion of regret. 

Clearly, such extensions greatly increase the number of potential appli- 
cations of bandit models. While many of these extensions were already 
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discussed in the previous chapters, in the following we focus on others 
(such as the sleeping bandits or the thruthful bandits) so to visit more 
exotic regions of the bandit space. 



7.1 Markov Decision Processes, restless and sleeping bandits 



Extending further the model of Markovian bandits (mentioned at the 
end of Chapter [1]), one can also define a general Markov Decision Pro- 
cess (MDP) — see also Section 17,11 For example, the stochastic bandit 
of Chapter [2] corresponds to a single-state MDP. 

In full generality, a finite MDP can be described by a set of states 
{1, . . . , S}, a set of actions {1, . . . ,K}, a set {pi, s , 1 < i < K , 1 < 
s < S} of transition distributions over S, and a set {z^, 1 < i < 
K , 1 < s < S} of reward distributions over [0,1]. In this model, tak- 
ing action i in state s generates a stochastic reward drawn from 
and a Markovian transition to a state drawn from pi tS . Similarly to 
the multi-armed bandit problem, here one typically assumes that the 
reward distributions and transition distributions are unknown, and the 
goal is to navigate through the MDP so as to maximize some function 
of the obtained rewards. The field that studies this type of problem is 
called Reinforcement Learning. The int e rested reader is a ddressed to 
Sutton and Bartol 19981 ] . iKakadd 20031 ]. ISzepesvari [20ld ]. Reinforce- 
ment learning results with a flav or similar to those described in the 
previous chapters can be f o und inlYu et a l. 2009] , iBubeck and Munos 
2O10l ]. |jaksch et al.l [20I0I ] . iNeu et al.1 |201p| ]. 

An intermediate model, between stochastic multi-armed bandits 
and MDPs, is the one of restless bandits. As in Markovian bandits, 
each arm is associated with a Markovian reward process with its own 
state space. Each time an arm is chosen, the associated Markov pro- 
cess generates an observable reward and makes a transition to a new 
state, which is also observed. However, unlike Markovian bandits an 
unobserved transition occurs for each arm that is not chosen. Usin g 



concentration inequalities for Markov chains — see, e.g.. iLezaudl 1998], 
one can basically show that, under suitable assumption s, UCB attains 
a loga rith mic regret for restle ss bandits as well 



-sec 



Tekin and Liu 



20121 ] and iFilippi et al.1 20111 ] . A more general regret bound for rest- 
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less bandits has been recently proven by lOrtner et al 
An apparently simil ar problem was 



Garivier and Moulines 



2012]. 
studied 

201 1| . where they assume that the 



by 



re- 



ward distributions can abruptly change at unknown time instants (and 
there is a small number of such change-points). Within this model, 
the authors prove that the best possible regret is of order y/n, which 
is matched by the Exp3.P algorithm — see the discussion in Section 
13.4.31 Thus, while the two problems (restless bandits and bandits with 
change-points) might look similar, they are fundamentally different. 
In particular, note that the latter problem cannot be cast as a MDP. 

Another intermediate model, with important applications, is 
that of the sleeping bandits. There, it is assumed that the set 
of availa ble actions is varying over time. We refe r the interested 
reader to iKleinberg etaljboiof . iKanade et all |2009| . Isiivkinsl |201lj ]. 



Kanade and Steinkel 2012] for the details of this model as well as the 



theoretical guarantees t hat can be obt a ined. A somewhat related prob- 
lem was also studied in iGvorgv et ajj [20071 ] where it is assumed that 
the set of arms becomes unavailable for a random time after each arm 
pull (and the distribution of this random time depends on the selected 



arm . 



7.2 Pure exploration problems 



The focus of bandits, and most of their variants, is on problems where 
there is a notion of cumulative rewards, which is to be maximized. 
This criterion leaves out a number of important applications where 
there is an online aspect (e.g., sequential decisions), but the goal is 
not maximizing cumulative rewards. The simplest example is perhaps 
the pure exploration version of stochastic bandits. In this model, at 
the end of round n the algorithm has to output a recommendation J n 
which represents its estimate for the optimal arm. The fo cus here is on 



the control of the so-called simple regret, introduced by iBubeck et al 
2009al. l2011bll an d defined as r n = ri* - E ri Jn . 



Bubeck et al.l 2009al ] prove that minimizing the simple regret is 



fundamentally different from minimizing the pseudo-regret R n , in the 
sense that one always have r n > exp(— CR n ) for some constant C > 
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(which depends on the reward distributions). Thus, this regret calls 
for different bandit algorithms. lAudibert et alJ 20101 ] exhibit a simple 
strategy with optimal performances up to a logarithmic factor. The 
idea is very simple: the strategy SR (Successive Rejects) works in K — 1 
phases. SR keeps a set of active arms, that are sampled uniformly in 
each phase. At the end of a phase, the arm with smallest empirical 
mean is removed from the set of active arms. It can be shown that 
this strategy has a simple regret of order exp {—c H ^ K ), where H = 
Y^i^i* * s the complexity measure of identifying the best arm, and 
c is a numerical constant. Moreover, a matching lower bound (up to 
logarithmic fact ors) was also proved. These ideas we r e extended in 
differen t ways by Gabillon et al. 2011 ] , Bui et al. 2011 ] , Bubeck et al.l 



20124 



A similar problem was studied in a PAC model bv lEven-Dar et al 



2002]. The goal is to find, with probability at least 1 — 5, an arm 



with mean at least e close the optimal mean, and the relevant quan- 
tity is the number of pulls needed to achieve this goal. For this prob- 
lem, the authors derive an algorithm called Successive Elimination that 
achieves an optimal number of pulls (up to logarithmic factors). Suc- 
cessive Elimination works as follows: it keeps an estimate of the mean 
of each arm, together with a confidence interval. If two confidence in- 
tervals are disjoint, then the arm with the lowest confidence interval is 
eliminated. Using this procedure, one can achieve the (e, 8) guarantee 
with a number of pulls of order H ln^-. A matching lower bound is 
du e to iMannor and Tsitsiklisl [20041 ] . and further results are discussed 



by lEven-Dar et al 



200 



61. 



In some applications one is not interested in the best arm, but rather 
in having a good estimate of the mean for each arm. In this setting 
a reasonable measure of performance is given by 



L, 



E 



K 

E 1 

i=i 



fJ-i — P>i,Ti( 



n)Y 



Clearly, the optimal static allocation depends only on the variances 
of the arms, and we denote by L* the per f orman ce of this strategy. 
This setting was introduced by lAntos et alJ [20081 ] , where the authors 
studied the regret L n — L*, and showed that a regret of order n~ 3 / 2 
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was achievable. This result w as then refined by lCarpentier et al.1 20111 ] . 
Carpentier and Munosl 20111 ] . The basic idea in these papers is to resort 
to the optimism in face of uncertainty principle, and to approximate 
the optimal static allocation by replacing the true variance with an 
upper confidence bound on it. 



7.3 Dueling bandits 

An interesting v ariation of stochastic bandits was recently studied by 
Yue et al.l [20091 ] . The model considered in that paper is called dueling 
bandits. The main idea is that the player has to choose a pair or arms 
(It, Jt) at each round, and can only observe the relative performances of 
these two arms, i.e., the player only knows which arm had the highest 
reward. More formally, in dueling bandits we assume that there exists 
a total ordering y on {1, . . . , K} with the following properties: 

(1) If i y j, then the probability that the reward of arm i is 
larger than the reward of arm j is equal to ^ + Ajj with 
A id > 0. 

(2) If i y j y k, then A { j + A Jjfc > A i)k > max{Ajj, A^ fc }. 



Upon selecting a pair (It,Jt), the player receives a random variable 
drawn from a Bernoulli distribution with parameter ^ + Ajj. In this 
setting a natural regret notion is the following quantity, where i* is the 
largest element in the ordering 

n 

E^(A iVt + Ai* )i7t ). 
t=i 



It was proved in lYue et al.1 [20091 ] that the optimal regret for this prob- 



lem is of order ^logn, where A = min^ Ajj . A simple strategy 
th at attains this rate, b ased on the Succe ssive Elimination a l gorith m 
of lEven-Dar et al.l [2002], was proposed by lYue and Joachims! [2011 ]. 



7.4 Discovery with probabilistic expert advice 



Bubeck et al.l 2011a| | study a model with a stochastic bandit flavor (in 
fact it can be cast as an MDP), where the key for the analysis is a sort 
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of 'non-linear' regret bound. In this model rewards represent items in 
some set X which is partitioned in a subset A C X of interesting items 
and in a subset X \ A of non-interesting items. The goal is to maximize 
the total expected number of interesting items found after n pulls, 
where observing twice the same item does not help. A natural notion 
of regret is obtained by comparing the number of interesting items F(n) 
found by a given strategy to the number F*(n) found by the optimal 
strategy. It turns out that analyzing such regret directly is difficult. The 
first issue is that in this problem the notion of a "good" arm is dynamic, 
in the sense that an arm could be very good for a few pulls and then 
completely useless. Furthermore a strategy making bad decisions in the 
beginning will have better opportunities in the future than the optimal 
strategy (which already discovered some interesting items) . Taking into 
account these issues, it turns out that it is easier to show that for good 
strategies, F(n) is not too far from F*(n'), where n' is not much smaller 
than n. Such a statement - which can be interepreted as a non-linear 
regret bound - shows t hat the analyzed stra tegy slightly 'lags' behind 



the optimal strategy. In lBubeck et al.l 2011al ] a non-linear regret bound 



is derived for an algorithm based on estimating the mass of interesting 
items left on each arm (the so-called Good- Turing estimator) , combined 
with the optimis m in face of uncertain ty principle of Chapter We 
refer the reader to lBubeck et al.l 201 lal ] for more precise statements. 



7.5 Many-armed bandits 



The many-armed bandit setting was introduced bylBerry et al.l 19971 ] . 
and then extended and refined by Wang et al. 20081 ] . This setting cor- 
responds to a stochastic bandit with an infinite number of arms. The 
extra assumption that makes this problem feasible is a prior knowledge 
on the distribution of the arms. More precisely, when the player ask 
to "add" a new arm to his current set of active arms, one assumes 
that the probability that this arm is e-optimal is of order e@ , for some 
known /3 > 0. Thus the player faces a trade-off between exploitation, 
exploration, and discovery, where the last component comes from the 
fact that the player needs to consider new arms to make sure that he 
has an active e-optimal arm. Using a UCB strategy on the active arms, 
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and b y adding new arms at a rate which depends on /3, IWang et al 



2008] prove that a regret of order 



maxs 

n 



C { 2' 1+/3 } 

is achievable in this setting. 



7.6 Truthful bandits 



A popular application domain for bandit algorithms is ad placement on 
the Web. In the pay-per-click model, for each incoming user t = 1,2,... 
the publisher selects an advertiser It from a pool of K advertisers, and 
display the corresponding ad to the user. The publisher then gets a 
reward if the ad is clicked by the user. This problem is well modeled 
by the multi-armed bandit setting. However, there is a fundamental 
aspect of the ad placement process which is overlooked by this formu- 
lation. Indeed, prior to running an ad-selection algorithm (i.e., a bandit 
algorithm), each advertiser i E {1, . . . , K} issues a bet bi. This number 
is how much i is willing to pay for a click. Each bidder keeps also a 
private value Uj, which is the true value i assigns to a click. Because 
a rational bidder ensures that bi < i>i, the difference Vi — bi defines 
the utility for bidder i. The basic idea of truthful bandits is to con- 
struct a bandit algorithm such that each advertiser has no incentive 
in submitting a bet bi such that bi < V{. A natural question to ask is 
whether this restriction to truthful algorithms changes the dynamics 

las investigated in a numbe r 



of the multi-armed 
of papers, including 


aandit problem. This 


Babaioff et al. 20091 


Babaioff et al. 


2010 




Wilkens and Si van 



Devanur and Kakadd [20091 ] . 
2012|. Thruthful bandits are 



part of a more general thread of research at the interface between ban- 
dits and Mechanism Design. 



7.7 Concluding remarks 

As pointed out in the introduction, the growing interest for bandits 
arises from the large number of industrially relevant problems that 
can be modeled as a multi-armed bandit. In particular, the sequential 
nature of the bandit setting makes it perfectly suited to various Internet 
and Web applications. These include search engine optimization with 
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dueling bandits, or ad placement with contextual bandits and truthful 
bandits, see the references in, respectively, Section 1731 Section H31 and 
Section 17.61 

Multi- armed bandits also proved to be very useful in other areas. 
For example, thanks to the strong connections between bandits and 
Markov Decision Processes, a breakthrough in Monte Carlo Tree Search 
(MCTS) was achieved usi ng bandits ideas. More precisely, based on 
the sp arse planning idea of lKearns et al.l [20021 ] . iKocsis and Szepesvari 
20061 ] introduced a new MCTS strategy called UCT (UCB applied to 



Trees) that le d to a substantia l advancement in Computer Go per- 



formance, see iGelly et al.l 20061 ] . Note that, fro m a theoretical point 



of vie w UCT was proved to perform poorly by ICoquelin and Munos 



20071 ] , and a strategy based on a sim ilar idea, but with improve d theo- 



retical performance, was proposed by Bubeck and Munos 201ol ]. Other 
appl ications in related direction s have also been explored, se e for exam- 
ple lTevtaud and Tevtaudl [200fl| | . iHoock and Tevtaudl [20ld ] and many 
others. 

Many new domains of application for bandits problems are cur- 
rently i nvestigated. For example: multich annel opportunistic co mmuni- 



cations 
ing 



Liu et al 



201 0]], model s election lAgarwal et al. 



2011a 



boost- 

Busa-Fekete and Kegll [20 111 ] , man agement of dark pools of liquid- 



ity (a recent type of stoc k exchange) lAgarwal et al.l [2010a|, security 



analysis of power systems iBubeck et al.l 2011al ] 



Given the fast pace of new variants, extensions, and applications 
coming out every week, we had to make tough decisions about what 
to present in this survey. We apologize for everything we had to leave 
out. On the other hand, we do hope that what we decided to put in 
will enthuse more researchers about entering this exciting field. 
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